Commit graph

16 commits

Author SHA1 Message Date
Evan Lloyd New-Schmidt
0fc43767aa Add script
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
d6e892343b Keep charset tags
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
ac556bd3d4 Save and log build commit
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
aa213fbece Make new qid writes atomic
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
75f4f6a21b
Add option to dump new QIDs (#20)
This allows us to extract articles that we know the title of but not the QID of from other language's dumps in a another pass.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-13 14:04:52 -04:00
Evan Lloyd New-Schmidt
45efd77c0d Remove images and links
See #11 for next steps

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:58:56 -04:00
Evan Lloyd New-Schmidt
9036e3413f
Write to generator-compatible folder structure (#6)
The map generator expects a certain folder structure created by the
current scraper to add the article content into the mwm files.

- Article html is written to wikidata directory.
- Directories are created for any matched titles and symlinked to the
  wikidata directory.
- Articles without a QID are written to article title directory.
- Article titles containing `/` are not escaped, so multiple
  subdirectories are possible.

The output folder hierarchy looks like this:

    .
    ├── de.wikipedia.org
    │  └── wiki
    │     ├── Coal_River_Springs_Territorial_Park
    │     │  ├── de.html
    │     │  └── ru.html
    │     ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park
    │     │  ├── de.html
    │     │  └── en.html
    │    ...
    ├── en.wikipedia.org
    │  └── wiki
    │     ├── Arctic_National_Wildlife_Refuge
    │     │  ├── de.html
    │     │  ├── en.html
    │     │  ├── es.html
    │     │  ├── fr.html
    │     │  └── ru.html
    │     ├── Baltimore
    │     │  └── Washington_International_Airport
    │     │     ├── de.html
    │     │     ├── en.html
    │     │     ├── es.html
    │     │     ├── fr.html
    │     │     └── ru.html
    │    ...
    └── wikidata
       ├── Q59320
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
       ├── Q120306
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
      ...

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:34:20 -04:00
Evan Lloyd New-Schmidt
bb1f897cd2 Add checks for whitespace/empty strings in ids and titles
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
0a0317538c Rewrite comments as sentences for readability
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
8435682ddf Add support for multiple languages
Per-language section removal is configured with a static json file.

This includes a test to make sure the file exists and is formatted correctly.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
35faadc693 Optimize wikipedia title parsing
The largest time by far was wasted because I used `ok_or` instead of
`ok_or_else` for converting to the anyhow errors. I thought with a
static string and no arguments they weren't that expensive, but they
still took a lot of time, even with backtraces disabled.

Removing them reduces the benchmark time 60x:

    running 4 tests
    test hash_wikidata   ... bench:          14 ns/iter (+/- 0)
    test hash_wikipedia  ... bench:          36 ns/iter (+/- 11)
    test parse_wikidata  ... bench:          18 ns/iter (+/- 0)
    test parse_wikipedia ... bench:         835 ns/iter (+/- 68)

I also tried removing url::parse and using string operations. That got
it down another 10x, but I don't think that's worth potential bugs
right now.

A small optimization for reading the json: sys::stdin() is behind a
lock by default, and already has an internal buffer.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
f12e8d802c Add id parsing benchmarks
Initial results:

    running 4 tests
    test hash_wikidata   ... bench:          14 ns/iter (+/- 0)
    test hash_wikipedia  ... bench:          34 ns/iter (+/- 1)
    test parse_wikidata  ... bench:          18 ns/iter (+/- 0)
    test parse_wikipedia ... bench:      60,682 ns/iter (+/- 83,376)

Based on these results and a flamegraph of loading the file, url parsing
is the most expensive part.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
d55d3cc7e0 Initial parsing and processing
The html processing should perform both of the main steps handled from the original
`descriptions_downloader.py` script:
- remove specific sections, e.g. "References"
- remove elements with no non-whitespace text

Determining how similar the output is will require more testing.

A separate binary target is included for standalone html processing.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
aba31775fa
Setup GitHub (#2)
* Fix license identifier
* Add CI tests

A cached setup that includes cargo check, clippy, fmt, and test

* Fix formatting
* Remove explicit rustup install

cargo, etc. is already installed, see:
<https://github.com/actions/runner-images/blob/main/images/linux/Ubuntu2204-Readme.md>

* Add more context to cache prefix-key
* Apply suggestions from code review
* Ignore non-rust files
* Remove unused matrix testing key
* Use a better filename

---------

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>
2023-06-01 09:25:35 +02:00
Evan Lloyd New-Schmidt
ddf6028465
Initial rust setup (#1)
* Initial rust setup

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

* Update README.md

Co-authored-by: Evan Lloyd New-Schmidt <newsch@users.noreply.github.com>

---------

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>
2023-05-30 19:00:05 +02:00
Alexander Borsuk
f72e380d11
Initial commit 2023-05-30 16:01:35 +02:00