Commit graph

76 commits

Author SHA1 Message Date
Evan Lloyd New-Schmidt
4b776f49d4 Add denylist from Extracts API
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
75fa04407d Add snapshot tests for html output
- Article contents are from the 2023-04-01 Wikipedia Enterprise Dump
- Add benchmark for HTML processing

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
32cd084f3f Add simplification logging
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
c9eb7a160a Add option to not simplify when extracting
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
941d2b1032 Structure parse errors and only log warning if above threshold
- Add custom error types with `thiserror` crate in preparation for .
- Parsing errors are captured instead of logged to `warn` by default.
    - All parsing errors are still logged to `debug` level.
    - If >= 0.02% of tags can't be parsed, an error is logged.
    - TSV line errors are always logged as errors.
    - I/O errors will fail instead of be logged.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
34bb9318d5 Refactor and rename title/qid wrappers
- Move Qid and Title to separate modules
- Reformat benchmark

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
bdf6f1a68c Improve url handling
- Check for urls in osm tags
- Handle mobile urls

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
6d242a62aa Extract tags in parallel in rust
- Use rayon and osmpbf crates, output intermediate TSV file in the same
  format as osmconvert, for use with the new `--osm-tags` flag.
- Number of threads spawned can be configured with `--procs` flag.
- Replace all wikidata id references with QID.
- Update script and documentation to use new subcommands.
- run.sh now expects a pbf file to extract tags from.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
b6db70f74c Refactor into subcommands
- Use CLI subcommands (e.g. `om-wikiparser get-articles`)
- Move article processing into a separate module
- Convert simplify helper from separate binary to subcommand

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
5df2d8d243 Add new option to parse osm tag file
Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the "--csv" output of `osmconvert`.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
0fc43767aa Add script
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
d6e892343b Keep charset tags
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
ac556bd3d4 Save and log build commit
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
aa213fbece Make new qid writes atomic
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
75f4f6a21b
Add option to dump new QIDs ()
This allows us to extract articles that we know the title of but not the QID of from other language's dumps in a another pass.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-13 14:04:52 -04:00
Evan Lloyd New-Schmidt
45efd77c0d Remove images and links
See  for next steps

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:58:56 -04:00
Evan Lloyd New-Schmidt
9036e3413f
Write to generator-compatible folder structure ()
The map generator expects a certain folder structure created by the
current scraper to add the article content into the mwm files.

- Article html is written to wikidata directory.
- Directories are created for any matched titles and symlinked to the
  wikidata directory.
- Articles without a QID are written to article title directory.
- Article titles containing `/` are not escaped, so multiple
  subdirectories are possible.

The output folder hierarchy looks like this:

    .
    ├── de.wikipedia.org
    │  └── wiki
    │     ├── Coal_River_Springs_Territorial_Park
    │     │  ├── de.html
    │     │  └── ru.html
    │     ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park
    │     │  ├── de.html
    │     │  └── en.html
    │    ...
    ├── en.wikipedia.org
    │  └── wiki
    │     ├── Arctic_National_Wildlife_Refuge
    │     │  ├── de.html
    │     │  ├── en.html
    │     │  ├── es.html
    │     │  ├── fr.html
    │     │  └── ru.html
    │     ├── Baltimore
    │     │  └── Washington_International_Airport
    │     │     ├── de.html
    │     │     ├── en.html
    │     │     ├── es.html
    │     │     ├── fr.html
    │     │     └── ru.html
    │    ...
    └── wikidata
       ├── Q59320
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
       ├── Q120306
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
      ...

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:34:20 -04:00
Evan Lloyd New-Schmidt
bb1f897cd2 Add checks for whitespace/empty strings in ids and titles
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
0a0317538c Rewrite comments as sentences for readability
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
8435682ddf Add support for multiple languages
Per-language section removal is configured with a static json file.

This includes a test to make sure the file exists and is formatted correctly.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
35faadc693 Optimize wikipedia title parsing
The largest time by far was wasted because I used `ok_or` instead of
`ok_or_else` for converting to the anyhow errors. I thought with a
static string and no arguments they weren't that expensive, but they
still took a lot of time, even with backtraces disabled.

Removing them reduces the benchmark time 60x:

    running 4 tests
    test hash_wikidata   ... bench:          14 ns/iter (+/- 0)
    test hash_wikipedia  ... bench:          36 ns/iter (+/- 11)
    test parse_wikidata  ... bench:          18 ns/iter (+/- 0)
    test parse_wikipedia ... bench:         835 ns/iter (+/- 68)

I also tried removing url::parse and using string operations. That got
it down another 10x, but I don't think that's worth potential bugs
right now.

A small optimization for reading the json: sys::stdin() is behind a
lock by default, and already has an internal buffer.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
f12e8d802c Add id parsing benchmarks
Initial results:

    running 4 tests
    test hash_wikidata   ... bench:          14 ns/iter (+/- 0)
    test hash_wikipedia  ... bench:          34 ns/iter (+/- 1)
    test parse_wikidata  ... bench:          18 ns/iter (+/- 0)
    test parse_wikipedia ... bench:      60,682 ns/iter (+/- 83,376)

Based on these results and a flamegraph of loading the file, url parsing
is the most expensive part.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
d55d3cc7e0 Initial parsing and processing
The html processing should perform both of the main steps handled from the original
`descriptions_downloader.py` script:
- remove specific sections, e.g. "References"
- remove elements with no non-whitespace text

Determining how similar the output is will require more testing.

A separate binary target is included for standalone html processing.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
aba31775fa
Setup GitHub ()
* Fix license identifier
* Add CI tests

A cached setup that includes cargo check, clippy, fmt, and test

* Fix formatting
* Remove explicit rustup install

cargo, etc. is already installed, see:
<https://github.com/actions/runner-images/blob/main/images/linux/Ubuntu2204-Readme.md>

* Add more context to cache prefix-key
* Apply suggestions from code review
* Ignore non-rust files
* Remove unused matrix testing key
* Use a better filename

---------

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>
2023-06-01 09:25:35 +02:00
Evan Lloyd New-Schmidt
ddf6028465
Initial rust setup ()
* Initial rust setup

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

* Update README.md

Co-authored-by: Evan Lloyd New-Schmidt <newsch@users.noreply.github.com>

---------

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>
2023-05-30 19:00:05 +02:00
Alexander Borsuk
f72e380d11
Initial commit 2023-05-30 16:01:35 +02:00