- Add @otype and @version columns to the `get-tags` output.
- Parse @otype, @oname, and @version columns in osm tagfiles.
- Attach and output available metadata in the `tag-errors` command.
OSM ids are not shared across nodes, ways, and relations, so the object
type should be saved as well. Including the edit version will make it
easier to see if a mis-tagged object is outdated.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Write a TSV file with the line number, error, and input text.
- Include OSM object id if available in tag file.
- Update run script to write file once before extracting.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
I've tested manually and it:
- handles panics with a static message or formatted arguments
- logs an error instead of exiting (backtraces are still printed)
- writes any panic-causing html to an `errors/` subdirectory
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
The caught html panics still print backtraces. Disabling it in rust
would require changing the global panic handler when entering and
exiting the function.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Downloads latest enterprise dumps in requested languages
- Uses parallel downloading with wget2 if available
- Dumps are stored in subdirectories by date
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Combine expansion steps
- Pull original steps into functions
- Use parent sections for removing specific headers
- Remove head in initial stage
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Article contents are from the 2023-04-01 Wikipedia Enterprise Dump
- Add benchmark for HTML processing
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Add custom error types with `thiserror` crate in preparation for #25.
- Parsing errors are captured instead of logged to `warn` by default.
- All parsing errors are still logged to `debug` level.
- If >= 0.02% of tags can't be parsed, an error is logged.
- TSV line errors are always logged as errors.
- I/O errors will fail instead of be logged.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Use rayon and osmpbf crates, output intermediate TSV file in the same
format as osmconvert, for use with the new `--osm-tags` flag.
- Number of threads spawned can be configured with `--procs` flag.
- Replace all wikidata id references with QID.
- Update script and documentation to use new subcommands.
- run.sh now expects a pbf file to extract tags from.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Use CLI subcommands (e.g. `om-wikiparser get-articles`)
- Move article processing into a separate module
- Convert simplify helper from separate binary to subcommand
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the "--csv" output of `osmconvert`.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
This allows us to extract articles that we know the title of but not the QID of from other language's dumps in a another pass.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Per-language section removal is configured with a static json file.
This includes a test to make sure the file exists and is formatted correctly.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
The largest time by far was wasted because I used `ok_or` instead of
`ok_or_else` for converting to the anyhow errors. I thought with a
static string and no arguments they weren't that expensive, but they
still took a lot of time, even with backtraces disabled.
Removing them reduces the benchmark time 60x:
running 4 tests
test hash_wikidata ... bench: 14 ns/iter (+/- 0)
test hash_wikipedia ... bench: 36 ns/iter (+/- 11)
test parse_wikidata ... bench: 18 ns/iter (+/- 0)
test parse_wikipedia ... bench: 835 ns/iter (+/- 68)
I also tried removing url::parse and using string operations. That got
it down another 10x, but I don't think that's worth potential bugs
right now.
A small optimization for reading the json: sys::stdin() is behind a
lock by default, and already has an internal buffer.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Initial results:
running 4 tests
test hash_wikidata ... bench: 14 ns/iter (+/- 0)
test hash_wikipedia ... bench: 34 ns/iter (+/- 1)
test parse_wikidata ... bench: 18 ns/iter (+/- 0)
test parse_wikipedia ... bench: 60,682 ns/iter (+/- 83,376)
Based on these results and a flamegraph of loading the file, url parsing
is the most expensive part.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
The html processing should perform both of the main steps handled from the original
`descriptions_downloader.py` script:
- remove specific sections, e.g. "References"
- remove elements with no non-whitespace text
Determining how similar the output is will require more testing.
A separate binary target is included for standalone html processing.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>