Add osm tag file parsing #23
No reviewers
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/wikiparser#23
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "osm-tags"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the
--csv
output ofosmconvert
.Closes #19.
Notes from parsing all planet wikipedia/wikidata tags:
;
/း
instead of:
Q123;Q124
(Q123)
Warfstermolen (Q1866088)
There are 50 wikipedia entries with url escaping, some are urls instead of titles and not handled correctly.
Remaining work:
serialize parse errors to disk for changesAdd structured parse errors (see #25 for rest), log summaryosmpbf
crate to parse planet file to fixosmconvert
truncation problemrun.sh
to use new methodIs there an invalid UTF in OpenStreetMap.org? Any examples? Can it be fixed?
It would be great to fix errors in OpenStreetMap.org directly.
On the other hand, it is important to log errors with OpenStreetMap.org data and report them to osmers or fix ourselves, while producing robust output.
I'm not sure yet if it's in the OSM db or something that got messed up by
osmium
/osmconvert
, and I can't copy/paste them here, but I attached the relevant lines here:planet-wikipedia-utf8-errors.csv
I'll add an option to output the errors in a structured way that we can deal with. Some are straightforward fixes, others might require someone with local knowledge that we can leave notes on.
Thanks! Could it be some local locale issue? What is your terminal locale/code page? Does it support non-US ones? Or it could be some osmium issue.
There is nothing wrong with the output:
https://www.openstreetmap.org/api/0.6/node/1648348163
https://www.openstreetmap.org/node/1648348163
https://www.openstreetmap.org/api/0.6/way/53580264
https://www.openstreetmap.org/api/0.6/relation/7032757
My terminal is set to
en_US.UTF-8
, but it went straight fromosmium
->osmconvert
-> file.I have an outdated planet file, I'll extract the full objects from there and see.
osmium
outputs it fine, butosmconvert
doesn't.I think it's a truncation issue, they're all around 257 bytes long, the error is at the end, and
osmconvert
has a fixed buffer of length 256 for keys/values.It looks like these aren't titles but notes? or copied text from the articles? Wikipedia has a limit of 255 characters in titles.
Good catch! You can patch osmconvert to increase the size, or try native rust approach.
The link above says 255 urf-8 bytes in title, not 255 characters.
You may try to osmupdate your planet (don't forget to drop authors and history).
Sorry, I misspoke - yes, 255 bytes of utf-8. I'll try the rust parser you used.
Such a big PR! It would be easier to do it in parts.
How does simplification output look now?
Can we run and test it?
@ -4,22 +4,21 @@ use std::{collections::HashSet, str::FromStr};
extern crate om_wikiparser;
extern crate test;
Why is it not in one line?
A constant to avoid copy-paste?
@ -23,3 +21,4 @@
let title = Title::from_url(TITLE).unwrap();
let mut set = HashSet::new();
b.iter(|| {
set.insert(&title);
A constant to avoid copy-paste?
I will try to break it up into better commits and let you know - some of it is refactoring that isn't helpful to see together.
This doesn't touch simplification, I am doing that next. I'll add my updates to #4 and open a PR with the changes.
Sure! I've been testing it as I go, and
run.sh
is updated to extract the tags from a pbf file. The new format is documented in the script and the README, the TL;DR is that now you pass a pbf file as well:@ -4,22 +4,21 @@ use std::{collections::HashSet, str::FromStr};
extern crate om_wikiparser;
extern crate test;
I'm not sure, renaming it to the shorter
Title
must have alteredrustfmt
's heuristics.I think it will be easiest to review the remaining commits individually. They're reasonably separated and each has a meaningful commit message.
These contain meaningful changes:
These move things around and don't change meaningful functionality:
Thanks for clear commits and their descriptions!
@ -0,0 +134,4 @@
.unwrap_or_default();
let matching_titles = if wikipedia_titles.is_empty() {
Default::default()
What is the benefit of hiding errors under a threshold? Isn't it beneficial to see all errors and be able to estimate/compare the quality of the dump, and to easily grep/find what is most important, or feed the whole log to contributors for fixes?
Does it make sense to print wrong hosts in a log to fix/support them?
ditto
They are caught at a higher level and logged/saved with the full string
@ -0,0 +134,4 @@
.unwrap_or_default();
let matching_titles = if wikipedia_titles.is_empty() {
Default::default()
The threshold only determines if the message is
info
vserror
level.When you use the
run.sh
script with multiple languages it prints a copy of the hundreds of errors for each language.I think writing the parse errors to a file separately will be easier to read and deal with.
I'm open to other ideas.