This ensures that the config sections match Wikipedia's Unicode
normalization. We could also normalize every section in every article to
handle an edge case where non-normalized output is inadvertently created
as tags are joined, but I don't think that's worth it yet.
From <https://mediawiki.org/wiki/Unicode_normalization_considerations>:
> MediaWiki applies normalization form C (NFC) to Unicode text input.
> MediaWiki doesn't apply any normalization to its output, for example
> `cafe<nowiki/>́` becomes "café" (shows U+0065 U+0301 in a row,
> without precomposed characters like U+00E9 appearing).
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Article contents are from the 2023-04-01 Wikipedia Enterprise Dump
- Add benchmark for HTML processing
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Add custom error types with `thiserror` crate in preparation for #25.
- Parsing errors are captured instead of logged to `warn` by default.
- All parsing errors are still logged to `debug` level.
- If >= 0.02% of tags can't be parsed, an error is logged.
- TSV line errors are always logged as errors.
- I/O errors will fail instead of be logged.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Use rayon and osmpbf crates, output intermediate TSV file in the same
format as osmconvert, for use with the new `--osm-tags` flag.
- Number of threads spawned can be configured with `--procs` flag.
- Replace all wikidata id references with QID.
- Update script and documentation to use new subcommands.
- run.sh now expects a pbf file to extract tags from.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the "--csv" output of `osmconvert`.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Per-language section removal is configured with a static json file.
This includes a test to make sure the file exists and is formatted correctly.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
The html processing should perform both of the main steps handled from the original
`descriptions_downloader.py` script:
- remove specific sections, e.g. "References"
- remove elements with no non-whitespace text
Determining how similar the output is will require more testing.
A separate binary target is included for standalone html processing.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>