This ensures that the config sections match Wikipedia's Unicode
normalization. We could also normalize every section in every article to
handle an edge case where non-normalized output is inadvertently created
as tags are joined, but I don't think that's worth it yet.
From <https://mediawiki.org/wiki/Unicode_normalization_considerations>:
> MediaWiki applies normalization form C (NFC) to Unicode text input.
> MediaWiki doesn't apply any normalization to its output, for example
> `cafe<nowiki/>́` becomes "café" (shows U+0065 U+0301 in a row,
> without precomposed characters like U+00E9 appearing).
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
* Exclude more Korean sections
- "각주 및 참고 문헌": notes and references
- "같이 읽기": read also
- "관련 항목": related items
- "관련 홈페이지": related web sites
- "더 보기": see more
- "더 읽어보기": read more
- "둘러보기": look around
- "외부 링크 및 참고 자료": external links and references
- "외부 영상": external videos
- "외부링크": external links
- "주해": notes
- "참고 문헌 및 링크": references and links
- "참고 문헌": references
- "참고 서적": references
- "참고": notes
- "참고자료": referenced data
- "참조 문헌": references
- "참조 자료": referenced data
- "참조 항목": referenced items
- "참조": references
* Undo exclude "둘러보기", which is used as 'Nearby places'
I've tried a number of ways to do this, and this has been the simplest
and most reliable.
- Catches jobs that exit before calling the function.
- Doesn't mess with the kill_jobs hook and leave orphan processes.
- Bubbles up the exit code.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Add @otype and @version columns to the `get-tags` output.
- Parse @otype, @oname, and @version columns in osm tagfiles.
- Attach and output available metadata in the `tag-errors` command.
OSM ids are not shared across nodes, ways, and relations, so the object
type should be saved as well. Including the edit version will make it
easier to see if a mis-tagged object is outdated.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Write a TSV file with the line number, error, and input text.
- Include OSM object id if available in tag file.
- Update run script to write file once before extracting.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
I've tested manually and it:
- handles panics with a static message or formatted arguments
- logs an error instead of exiting (backtraces are still printed)
- writes any panic-causing html to an `errors/` subdirectory
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
The caught html panics still print backtraces. Disabling it in rust
would require changing the global panic handler when entering and
exiting the function.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Downloads latest enterprise dumps in requested languages
- Uses parallel downloading with wget2 if available
- Dumps are stored in subdirectories by date
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Combine expansion steps
- Pull original steps into functions
- Use parent sections for removing specific headers
- Remove head in initial stage
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Article contents are from the 2023-04-01 Wikipedia Enterprise Dump
- Add benchmark for HTML processing
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Add custom error types with `thiserror` crate in preparation for #25.
- Parsing errors are captured instead of logged to `warn` by default.
- All parsing errors are still logged to `debug` level.
- If >= 0.02% of tags can't be parsed, an error is logged.
- TSV line errors are always logged as errors.
- I/O errors will fail instead of be logged.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Use rayon and osmpbf crates, output intermediate TSV file in the same
format as osmconvert, for use with the new `--osm-tags` flag.
- Number of threads spawned can be configured with `--procs` flag.
- Replace all wikidata id references with QID.
- Update script and documentation to use new subcommands.
- run.sh now expects a pbf file to extract tags from.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>