Commit graph

15 commits

Author SHA1 Message Date
Evan Lloyd New-Schmidt
d723452ec5 Use tracing-logfmt
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-28 14:20:52 -04:00
Evan Lloyd New-Schmidt
cd03fed762
Add check for unicode normalization in config (#44)
This ensures that the config sections match Wikipedia's Unicode
normalization. We could also normalize every section in every article to
handle an edge case where non-normalized output is inadvertently created
as tags are joined, but I don't think that's worth it yet.

From <https://mediawiki.org/wiki/Unicode_normalization_considerations>:

> MediaWiki applies normalization form C (NFC) to Unicode text input.

> MediaWiki doesn't apply any normalization to its output, for example
> `cafe<nowiki/>́` becomes "café" (shows U+0065 U+0301 in a row,
> without precomposed characters like U+00E9 appearing).

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-24 10:34:11 -04:00
Jonathan Davies
a4caf23e35 Cargo.toml: Enabled LTO for release build. 2024-03-15 01:59:48 +01:00
Evan Lloyd New-Schmidt
580b60bdd4 Use tracing with pid, lang, title, url, line, and byte fields
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
218e55931f Remove debug info from release builds
A special 'bench' profile can be used instead.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
c3162c9fed Reduce unnecessary rebuilds
- Only embed commit on release builds.
- Add CI and scripts to excluded cargo files.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
75fa04407d Add snapshot tests for html output
- Article contents are from the 2023-04-01 Wikipedia Enterprise Dump
- Add benchmark for HTML processing

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
941d2b1032 Structure parse errors and only log warning if above threshold
- Add custom error types with `thiserror` crate in preparation for #25.
- Parsing errors are captured instead of logged to `warn` by default.
    - All parsing errors are still logged to `debug` level.
    - If >= 0.02% of tags can't be parsed, an error is logged.
    - TSV line errors are always logged as errors.
    - I/O errors will fail instead of be logged.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
6d242a62aa Extract tags in parallel in rust
- Use rayon and osmpbf crates, output intermediate TSV file in the same
  format as osmconvert, for use with the new `--osm-tags` flag.
- Number of threads spawned can be configured with `--procs` flag.
- Replace all wikidata id references with QID.
- Update script and documentation to use new subcommands.
- run.sh now expects a pbf file to extract tags from.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
5df2d8d243 Add new option to parse osm tag file
Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the "--csv" output of `osmconvert`.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
45efd77c0d Remove images and links
See #11 for next steps

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:58:56 -04:00
Evan Lloyd New-Schmidt
8435682ddf Add support for multiple languages
Per-language section removal is configured with a static json file.

This includes a test to make sure the file exists and is formatted correctly.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
d55d3cc7e0 Initial parsing and processing
The html processing should perform both of the main steps handled from the original
`descriptions_downloader.py` script:
- remove specific sections, e.g. "References"
- remove elements with no non-whitespace text

Determining how similar the output is will require more testing.

A separate binary target is included for standalone html processing.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
aba31775fa
Setup GitHub (#2)
* Fix license identifier
* Add CI tests

A cached setup that includes cargo check, clippy, fmt, and test

* Fix formatting
* Remove explicit rustup install

cargo, etc. is already installed, see:
<https://github.com/actions/runner-images/blob/main/images/linux/Ubuntu2204-Readme.md>

* Add more context to cache prefix-key
* Apply suggestions from code review
* Ignore non-rust files
* Remove unused matrix testing key
* Use a better filename

---------

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>
2023-06-01 09:25:35 +02:00
Evan Lloyd New-Schmidt
ddf6028465
Initial rust setup (#1)
* Initial rust setup

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

* Update README.md

Co-authored-by: Evan Lloyd New-Schmidt <newsch@users.noreply.github.com>

---------

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>
2023-05-30 19:00:05 +02:00