Commit graph

37 commits

Author SHA1 Message Date
Evan Lloyd New-Schmidt
d0480e9089 Make thread pool control similar to osmium
- Allow setting number relative to number of cores
- Default to Cores - 2 threads
- Add env variable OM_POOL_THREADS (lower priority than CLI)
- Rename CLI option to `-t/--threads`

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-01 12:03:58 -04:00
Evan Lloyd New-Schmidt
c7fe34f3ad Remove header ids
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
b96c2cf4db Refactor simplification
- Combine expansion steps
- Pull original steps into functions
- Use parent sections for removing specific headers
- Remove head in initial stage

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
6c02f4a569 Remove coordinates from output
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
c4028e52fa Preserve excerpts
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
3d3ecb52b2 Minify whitespace between elements
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
81783695d5 Remove doctype and html element
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
58f32b43fd Remove empty sections after other removals
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
cc3ae9b629 Remove "(listen)" text
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
81f528a350 Expand spans, sections, and body after removing head
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
0a0a94b484 Remove comments
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
4b776f49d4 Add denylist from Extracts API
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
75fa04407d Add snapshot tests for html output
- Article contents are from the 2023-04-01 Wikipedia Enterprise Dump
- Add benchmark for HTML processing

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
32cd084f3f Add simplification logging
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
c9eb7a160a Add option to not simplify when extracting
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
941d2b1032 Structure parse errors and only log warning if above threshold
- Add custom error types with `thiserror` crate in preparation for #25.
- Parsing errors are captured instead of logged to `warn` by default.
    - All parsing errors are still logged to `debug` level.
    - If >= 0.02% of tags can't be parsed, an error is logged.
    - TSV line errors are always logged as errors.
    - I/O errors will fail instead of be logged.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
34bb9318d5 Refactor and rename title/qid wrappers
- Move Qid and Title to separate modules
- Reformat benchmark

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
bdf6f1a68c Improve url handling
- Check for urls in osm tags
- Handle mobile urls

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
6d242a62aa Extract tags in parallel in rust
- Use rayon and osmpbf crates, output intermediate TSV file in the same
  format as osmconvert, for use with the new `--osm-tags` flag.
- Number of threads spawned can be configured with `--procs` flag.
- Replace all wikidata id references with QID.
- Update script and documentation to use new subcommands.
- run.sh now expects a pbf file to extract tags from.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
b6db70f74c Refactor into subcommands
- Use CLI subcommands (e.g. `om-wikiparser get-articles`)
- Move article processing into a separate module
- Convert simplify helper from separate binary to subcommand

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
5df2d8d243 Add new option to parse osm tag file
Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the "--csv" output of `osmconvert`.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
0fc43767aa Add script
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
d6e892343b Keep charset tags
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
ac556bd3d4 Save and log build commit
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
aa213fbece Make new qid writes atomic
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
75f4f6a21b
Add option to dump new QIDs (#20)
This allows us to extract articles that we know the title of but not the QID of from other language's dumps in a another pass.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-13 14:04:52 -04:00
Evan Lloyd New-Schmidt
45efd77c0d Remove images and links
See #11 for next steps

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:58:56 -04:00
Evan Lloyd New-Schmidt
9036e3413f
Write to generator-compatible folder structure (#6)
The map generator expects a certain folder structure created by the
current scraper to add the article content into the mwm files.

- Article html is written to wikidata directory.
- Directories are created for any matched titles and symlinked to the
  wikidata directory.
- Articles without a QID are written to article title directory.
- Article titles containing `/` are not escaped, so multiple
  subdirectories are possible.

The output folder hierarchy looks like this:

    .
    ├── de.wikipedia.org
    │  └── wiki
    │     ├── Coal_River_Springs_Territorial_Park
    │     │  ├── de.html
    │     │  └── ru.html
    │     ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park
    │     │  ├── de.html
    │     │  └── en.html
    │    ...
    ├── en.wikipedia.org
    │  └── wiki
    │     ├── Arctic_National_Wildlife_Refuge
    │     │  ├── de.html
    │     │  ├── en.html
    │     │  ├── es.html
    │     │  ├── fr.html
    │     │  └── ru.html
    │     ├── Baltimore
    │     │  └── Washington_International_Airport
    │     │     ├── de.html
    │     │     ├── en.html
    │     │     ├── es.html
    │     │     ├── fr.html
    │     │     └── ru.html
    │    ...
    └── wikidata
       ├── Q59320
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
       ├── Q120306
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
      ...

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:34:20 -04:00
Evan Lloyd New-Schmidt
bb1f897cd2 Add checks for whitespace/empty strings in ids and titles
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
0a0317538c Rewrite comments as sentences for readability
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
8435682ddf Add support for multiple languages
Per-language section removal is configured with a static json file.

This includes a test to make sure the file exists and is formatted correctly.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
35faadc693 Optimize wikipedia title parsing
The largest time by far was wasted because I used `ok_or` instead of
`ok_or_else` for converting to the anyhow errors. I thought with a
static string and no arguments they weren't that expensive, but they
still took a lot of time, even with backtraces disabled.

Removing them reduces the benchmark time 60x:

    running 4 tests
    test hash_wikidata   ... bench:          14 ns/iter (+/- 0)
    test hash_wikipedia  ... bench:          36 ns/iter (+/- 11)
    test parse_wikidata  ... bench:          18 ns/iter (+/- 0)
    test parse_wikipedia ... bench:         835 ns/iter (+/- 68)

I also tried removing url::parse and using string operations. That got
it down another 10x, but I don't think that's worth potential bugs
right now.

A small optimization for reading the json: sys::stdin() is behind a
lock by default, and already has an internal buffer.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
f12e8d802c Add id parsing benchmarks
Initial results:

    running 4 tests
    test hash_wikidata   ... bench:          14 ns/iter (+/- 0)
    test hash_wikipedia  ... bench:          34 ns/iter (+/- 1)
    test parse_wikidata  ... bench:          18 ns/iter (+/- 0)
    test parse_wikipedia ... bench:      60,682 ns/iter (+/- 83,376)

Based on these results and a flamegraph of loading the file, url parsing
is the most expensive part.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
d55d3cc7e0 Initial parsing and processing
The html processing should perform both of the main steps handled from the original
`descriptions_downloader.py` script:
- remove specific sections, e.g. "References"
- remove elements with no non-whitespace text

Determining how similar the output is will require more testing.

A separate binary target is included for standalone html processing.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
aba31775fa
Setup GitHub (#2)
* Fix license identifier
* Add CI tests

A cached setup that includes cargo check, clippy, fmt, and test

* Fix formatting
* Remove explicit rustup install

cargo, etc. is already installed, see:
<https://github.com/actions/runner-images/blob/main/images/linux/Ubuntu2204-Readme.md>

* Add more context to cache prefix-key
* Apply suggestions from code review
* Ignore non-rust files
* Remove unused matrix testing key
* Use a better filename

---------

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>
2023-06-01 09:25:35 +02:00
Evan Lloyd New-Schmidt
ddf6028465
Initial rust setup (#1)
* Initial rust setup

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

* Update README.md

Co-authored-by: Evan Lloyd New-Schmidt <newsch@users.noreply.github.com>

---------

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>
2023-05-30 19:00:05 +02:00
Alexander Borsuk
f72e380d11
Initial commit 2023-05-30 16:01:35 +02:00