wikiparser

Author	SHA1	Message	Date
Evan Lloyd New-Schmidt	0fc43767aa	Add script Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt	d6e892343b	Keep charset tags Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt	ac556bd3d4	Save and log build commit Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt	aa213fbece	Make new qid writes atomic Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt	75f4f6a21b	Add option to dump new QIDs (#20 ) This allows us to extract articles that we know the title of but not the QID of from other language's dumps in a another pass. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-07-13 14:04:52 -04:00
Evan Lloyd New-Schmidt	45efd77c0d	Remove images and links See #11 for next steps Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-07-10 10:58:56 -04:00
Evan Lloyd New-Schmidt	9036e3413f	Write to generator-compatible folder structure (#6 ) The map generator expects a certain folder structure created by the current scraper to add the article content into the mwm files. - Article html is written to wikidata directory. - Directories are created for any matched titles and symlinked to the wikidata directory. - Articles without a QID are written to article title directory. - Article titles containing `/` are not escaped, so multiple subdirectories are possible. The output folder hierarchy looks like this: . ├── de.wikipedia.org │ └── wiki │ ├── Coal_River_Springs_Territorial_Park │ │ ├── de.html │ │ └── ru.html │ ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park │ │ ├── de.html │ │ └── en.html │ ... ├── en.wikipedia.org │ └── wiki │ ├── Arctic_National_Wildlife_Refuge │ │ ├── de.html │ │ ├── en.html │ │ ├── es.html │ │ ├── fr.html │ │ └── ru.html │ ├── Baltimore │ │ └── Washington_International_Airport │ │ ├── de.html │ │ ├── en.html │ │ ├── es.html │ │ ├── fr.html │ │ └── ru.html │ ... └── wikidata ├── Q59320 │ ├── de.html │ ├── en.html │ ├── es.html │ ├── fr.html │ └── ru.html ├── Q120306 │ ├── de.html │ ├── en.html │ ├── es.html │ ├── fr.html │ └── ru.html ... Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-07-10 10:34:20 -04:00
Evan Lloyd New-Schmidt	bb1f897cd2	Add checks for whitespace/empty strings in ids and titles Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	0a0317538c	Rewrite comments as sentences for readability Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	8435682ddf	Add support for multiple languages Per-language section removal is configured with a static json file. This includes a test to make sure the file exists and is formatted correctly. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	35faadc693	Optimize wikipedia title parsing The largest time by far was wasted because I used `ok_or` instead of `ok_or_else` for converting to the anyhow errors. I thought with a static string and no arguments they weren't that expensive, but they still took a lot of time, even with backtraces disabled. Removing them reduces the benchmark time 60x: running 4 tests test hash_wikidata ... bench: 14 ns/iter (+/- 0) test hash_wikipedia ... bench: 36 ns/iter (+/- 11) test parse_wikidata ... bench: 18 ns/iter (+/- 0) test parse_wikipedia ... bench: 835 ns/iter (+/- 68) I also tried removing url::parse and using string operations. That got it down another 10x, but I don't think that's worth potential bugs right now. A small optimization for reading the json: sys::stdin() is behind a lock by default, and already has an internal buffer. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	f12e8d802c	Add id parsing benchmarks Initial results: running 4 tests test hash_wikidata ... bench: 14 ns/iter (+/- 0) test hash_wikipedia ... bench: 34 ns/iter (+/- 1) test parse_wikidata ... bench: 18 ns/iter (+/- 0) test parse_wikipedia ... bench: 60,682 ns/iter (+/- 83,376) Based on these results and a flamegraph of loading the file, url parsing is the most expensive part. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	d55d3cc7e0	Initial parsing and processing The html processing should perform both of the main steps handled from the original `descriptions_downloader.py` script: - remove specific sections, e.g. "References" - remove elements with no non-whitespace text Determining how similar the output is will require more testing. A separate binary target is included for standalone html processing. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	aba31775fa	Setup GitHub (#2 ) * Fix license identifier * Add CI tests A cached setup that includes cargo check, clippy, fmt, and test * Fix formatting * Remove explicit rustup install cargo, etc. is already installed, see: <https://github.com/actions/runner-images/blob/main/images/linux/Ubuntu2204-Readme.md> * Add more context to cache prefix-key * Apply suggestions from code review * Ignore non-rust files * Remove unused matrix testing key * Use a better filename --------- Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com> Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>	2023-06-01 09:25:35 +02:00
Evan Lloyd New-Schmidt	ddf6028465	Initial rust setup (#1 ) * Initial rust setup Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com> * Update README.md Co-authored-by: Evan Lloyd New-Schmidt <newsch@users.noreply.github.com> --------- Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com> Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>	2023-05-30 19:00:05 +02:00
Alexander Borsuk	f72e380d11	Initial commit	2023-05-30 16:01:35 +02:00

16 commits