Add osm tag file parsing #23

Merged
newsch merged 6 commits from osm-tags into main 2023-08-10 13:37:59 +00:00

6 commits

Author SHA1 Message Date
Evan Lloyd New-Schmidt
b250dd4b13 Structure parse errors and only log warning if above threshold
- Add custom error types with `thiserror` crate in preparation for #25.
- Parsing errors are captured instead of logged to `warn` by default.
    - All parsing errors are still logged to `debug` level.
    - If >= 0.02% of tags can't be parsed, an error is logged.
    - TSV line errors are always logged as errors.
    - I/O errors will fail instead of be logged.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-09 16:09:23 -04:00
Evan Lloyd New-Schmidt
29cdbe2301 Refactor and rename title/qid wrappers
- Move Qid and Title to separate modules
- Reformat benchmark

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-09 16:08:14 -04:00
Evan Lloyd New-Schmidt
2532d1365e Improve url handling
- Check for urls in osm tags
- Handle mobile urls

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-09 16:07:43 -04:00
Evan Lloyd New-Schmidt
3d48c39793 Extract tags in parallel in rust
- Use rayon and osmpbf crates, output intermediate TSV file in the same
  format as osmconvert, for use with the new `--osm-tags` flag.
- Number of threads spawned can be configured with `--procs` flag.
- Replace all wikidata id references with QID.
- Update script and documentation to use new subcommands.
- run.sh now expects a pbf file to extract tags from.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-09 16:04:27 -04:00
Evan Lloyd New-Schmidt
0ac935c175 Refactor into subcommands
- Use CLI subcommands (e.g. `om-wikiparser get-articles`)
- Move article processing into a separate module
- Convert simplify helper from separate binary to subcommand

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-09 16:02:16 -04:00
Evan Lloyd New-Schmidt
a2c113a885 Add new option to parse osm tag file
Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the "--csv" output of `osmconvert`.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-09 16:01:35 -04:00