wikiparser

Author	SHA1	Message	Date
Evan Lloyd New-Schmidt	54230ee4ff	Add portuguese language (#34 ) Relates to organicmaps/organicmaps#6153 Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com> Co-authored-by: Matheus Gomes <86851490+matheusgomesms@users.noreply.github.com>	2023-10-02 11:56:20 -04:00
Evan Lloyd New-Schmidt	9d1ad01f33	Improve script warnings/errors (#32 ) - Warn on unexpected file extensions - Move filename to end of errors Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-29 16:12:36 -04:00
Evan Lloyd New-Schmidt	29d90376f3	Move file parsing out of wm module Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt	faf4b760b2	Add OSM object metadata - Add @otype and @version columns to the `get-tags` output. - Parse @otype, @oname, and @version columns in osm tagfiles. - Attach and output available metadata in the `tag-errors` command. OSM ids are not shared across nodes, ways, and relations, so the object type should be saved as well. Including the edit version will make it easier to see if a mis-tagged object is outdated. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt	a584498c65	Add additional checks for langs/titles Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt	292eeac081	Add command to write tag errors to file - Write a TSV file with the line number, error, and input text. - Include OSM object id if available in tag file. - Update run script to write file once before extracting. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt	218e55931f	Remove debug info from release builds A special 'bench' profile can be used instead. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt	c3162c9fed	Reduce unnecessary rebuilds - Only embed commit on release builds. - Add CI and scripts to excluded cargo files. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt	33174511dd	Handle simplification panics I've tested manually and it: - handles panics with a static message or formatted arguments - logs an error instead of exiting (backtraces are still printed) - writes any panic-causing html to an `errors/` subdirectory Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-26 15:08:33 -04:00
Evan Lloyd New-Schmidt	3de06a3209	Disable printing backtraces by default. The caught html panics still print backtraces. Disabling it in rust would require changing the global panic handler when entering and exiting the function. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-26 15:08:33 -04:00
Evan Lloyd New-Schmidt	481ace45ce	Add Download script (#22 ) - Downloads latest enterprise dumps in requested languages - Uses parallel downloading with wget2 if available - Dumps are stored in subdirectories by date Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-26 11:45:07 -04:00
Evan Lloyd New-Schmidt	8191c36a5e	Format let-else Prior to rust 1.72.0, rustfmt ignored let-else statements: https://blog.rust-lang.org/2023/07/01/rustfmt-supports-let-else-statements.html Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-26 11:13:29 -04:00
Evan Lloyd New-Schmidt	d0480e9089	Make thread pool control similar to osmium - Allow setting number relative to number of cores - Default to Cores - 2 threads - Add env variable OM_POOL_THREADS (lower priority than CLI) - Rename CLI option to `-t/--threads` Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-09-01 12:03:58 -04:00
Evan Lloyd New-Schmidt	c7fe34f3ad	Remove header ids Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	b96c2cf4db	Refactor simplification - Combine expansion steps - Pull original steps into functions - Use parent sections for removing specific headers - Remove head in initial stage Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	6c02f4a569	Remove coordinates from output Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	c4028e52fa	Preserve excerpts Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	3d3ecb52b2	Minify whitespace between elements Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	81783695d5	Remove doctype and html element Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	58f32b43fd	Remove empty sections after other removals Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	cc3ae9b629	Remove "(listen)" text Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	81f528a350	Expand spans, sections, and body after removing head Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	0a0a94b484	Remove comments Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	4b776f49d4	Add denylist from Extracts API Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	75fa04407d	Add snapshot tests for html output - Article contents are from the 2023-04-01 Wikipedia Enterprise Dump - Add benchmark for HTML processing Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	32cd084f3f	Add simplification logging Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	c9eb7a160a	Add option to not simplify when extracting Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt	941d2b1032	Structure parse errors and only log warning if above threshold - Add custom error types with `thiserror` crate in preparation for #25. - Parsing errors are captured instead of logged to `warn` by default. - All parsing errors are still logged to `debug` level. - If >= 0.02% of tags can't be parsed, an error is logged. - TSV line errors are always logged as errors. - I/O errors will fail instead of be logged. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt	34bb9318d5	Refactor and rename title/qid wrappers - Move Qid and Title to separate modules - Reformat benchmark Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt	bdf6f1a68c	Improve url handling - Check for urls in osm tags - Handle mobile urls Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt	6d242a62aa	Extract tags in parallel in rust - Use rayon and osmpbf crates, output intermediate TSV file in the same format as osmconvert, for use with the new `--osm-tags` flag. - Number of threads spawned can be configured with `--procs` flag. - Replace all wikidata id references with QID. - Update script and documentation to use new subcommands. - run.sh now expects a pbf file to extract tags from. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt	b6db70f74c	Refactor into subcommands - Use CLI subcommands (e.g. `om-wikiparser get-articles`) - Move article processing into a separate module - Convert simplify helper from separate binary to subcommand Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt	5df2d8d243	Add new option to parse osm tag file Parse wikipedia and wikidata tags from a tsv file of OSM tags, compatible with the "--csv" output of `osmconvert`. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt	0fc43767aa	Add script Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt	d6e892343b	Keep charset tags Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt	ac556bd3d4	Save and log build commit Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt	aa213fbece	Make new qid writes atomic Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt	75f4f6a21b	Add option to dump new QIDs (#20 ) This allows us to extract articles that we know the title of but not the QID of from other language's dumps in a another pass. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-07-13 14:04:52 -04:00
Evan Lloyd New-Schmidt	45efd77c0d	Remove images and links See #11 for next steps Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-07-10 10:58:56 -04:00
Evan Lloyd New-Schmidt	9036e3413f	Write to generator-compatible folder structure (#6 ) The map generator expects a certain folder structure created by the current scraper to add the article content into the mwm files. - Article html is written to wikidata directory. - Directories are created for any matched titles and symlinked to the wikidata directory. - Articles without a QID are written to article title directory. - Article titles containing `/` are not escaped, so multiple subdirectories are possible. The output folder hierarchy looks like this: . ├── de.wikipedia.org │ └── wiki │ ├── Coal_River_Springs_Territorial_Park │ │ ├── de.html │ │ └── ru.html │ ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park │ │ ├── de.html │ │ └── en.html │ ... ├── en.wikipedia.org │ └── wiki │ ├── Arctic_National_Wildlife_Refuge │ │ ├── de.html │ │ ├── en.html │ │ ├── es.html │ │ ├── fr.html │ │ └── ru.html │ ├── Baltimore │ │ └── Washington_International_Airport │ │ ├── de.html │ │ ├── en.html │ │ ├── es.html │ │ ├── fr.html │ │ └── ru.html │ ... └── wikidata ├── Q59320 │ ├── de.html │ ├── en.html │ ├── es.html │ ├── fr.html │ └── ru.html ├── Q120306 │ ├── de.html │ ├── en.html │ ├── es.html │ ├── fr.html │ └── ru.html ... Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-07-10 10:34:20 -04:00
Evan Lloyd New-Schmidt	bb1f897cd2	Add checks for whitespace/empty strings in ids and titles Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	0a0317538c	Rewrite comments as sentences for readability Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	8435682ddf	Add support for multiple languages Per-language section removal is configured with a static json file. This includes a test to make sure the file exists and is formatted correctly. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	35faadc693	Optimize wikipedia title parsing The largest time by far was wasted because I used `ok_or` instead of `ok_or_else` for converting to the anyhow errors. I thought with a static string and no arguments they weren't that expensive, but they still took a lot of time, even with backtraces disabled. Removing them reduces the benchmark time 60x: running 4 tests test hash_wikidata ... bench: 14 ns/iter (+/- 0) test hash_wikipedia ... bench: 36 ns/iter (+/- 11) test parse_wikidata ... bench: 18 ns/iter (+/- 0) test parse_wikipedia ... bench: 835 ns/iter (+/- 68) I also tried removing url::parse and using string operations. That got it down another 10x, but I don't think that's worth potential bugs right now. A small optimization for reading the json: sys::stdin() is behind a lock by default, and already has an internal buffer. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	f12e8d802c	Add id parsing benchmarks Initial results: running 4 tests test hash_wikidata ... bench: 14 ns/iter (+/- 0) test hash_wikipedia ... bench: 34 ns/iter (+/- 1) test parse_wikidata ... bench: 18 ns/iter (+/- 0) test parse_wikipedia ... bench: 60,682 ns/iter (+/- 83,376) Based on these results and a flamegraph of loading the file, url parsing is the most expensive part. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	d55d3cc7e0	Initial parsing and processing The html processing should perform both of the main steps handled from the original `descriptions_downloader.py` script: - remove specific sections, e.g. "References" - remove elements with no non-whitespace text Determining how similar the output is will require more testing. A separate binary target is included for standalone html processing. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>	2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt	aba31775fa	Setup GitHub (#2 ) * Fix license identifier * Add CI tests A cached setup that includes cargo check, clippy, fmt, and test * Fix formatting * Remove explicit rustup install cargo, etc. is already installed, see: <https://github.com/actions/runner-images/blob/main/images/linux/Ubuntu2204-Readme.md> * Add more context to cache prefix-key * Apply suggestions from code review * Ignore non-rust files * Remove unused matrix testing key * Use a better filename --------- Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com> Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>	2023-06-01 09:25:35 +02:00
Evan Lloyd New-Schmidt	ddf6028465	Initial rust setup (#1 ) * Initial rust setup Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com> * Update README.md Co-authored-by: Evan Lloyd New-Schmidt <newsch@users.noreply.github.com> --------- Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com> Co-authored-by: Alexander Borsuk <170263+biodranik@users.noreply.github.com>	2023-05-30 19:00:05 +02:00
Alexander Borsuk	f72e380d11	Initial commit	2023-05-30 16:01:35 +02:00

49 commits