Commit graph

76 commits

Author SHA1 Message Date
Evan Lloyd New-Schmidt
bab29c0de9 Preserve whitespace of removed "empty" elements
Some articles use non-breaking spaces between quantities and units,
which Wikipedia seems to wrap with a span. Elements with no or
whitespace-only text were previously removed to prune `<link>`s and
parents of other removed elements.

This fix preserves the internal whitespace of elements that would
otherwise be removed for being "empty". It does not distinguish between
"meaningful" whitespace and padding between elements that would be
collapsed by HTML formatting rules. It also cannot distinguish between
elements that _started_ with only whitespace and nodes that now contain
only whitespace after previous steps. The preserved whitespace in the
latter case is unlikely to remain because of later processing steps.

Fixes #47, fixes organicmaps/organicmaps#8651

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-07-08 17:13:55 -04:00
Evan Lloyd New-Schmidt
3579410659 Remove pretty-printing
Whitespace behavior is different between Html::html and this
half-working pretty printer. Now the tests match the parser output
exactly.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-07-08 17:13:55 -04:00
Evan Lloyd New-Schmidt
f2692d2ede Add Arabic language
The initial section list is based on the top 200 headers in matched
articles passed through Google Translate.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-06-12 21:57:13 +02:00
Evan Lloyd New-Schmidt
1245d6365a Reorganize README for readability
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-30 12:17:59 -04:00
Evan Lloyd New-Schmidt
775e23cf1e Add links to additional title/lang requirements
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-30 12:17:59 -04:00
Evan Lloyd New-Schmidt
3d908a2866 Add warnings to commands that expect stdin
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-30 12:17:59 -04:00
Evan Lloyd New-Schmidt
e61f12d014 Make parse functions collection-agnostic
This makes it possible to use a Set in one place and a Vec in another,
to log or count items without allocating a collection for all of them,
and to ignore errors with no overhead.

The alternative is converting them to custom iterators, which is more work
than I want to do right now.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-28 14:20:52 -04:00
Evan Lloyd New-Schmidt
d723452ec5 Use tracing-logfmt
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-28 14:20:52 -04:00
Evan Lloyd New-Schmidt
1f7d0695e2 Add option to dump input json to stdout
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-28 14:20:52 -04:00
Evan Lloyd New-Schmidt
cb835fcbc6 Remove header sections by sibling instead of parent section
Not all languages have a nice `section > h2` structure.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-27 21:12:23 -04:00
Evan Lloyd New-Schmidt
578f8a319d Check for elements in sections before header
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-27 21:12:23 -04:00
Evan Lloyd New-Schmidt
cd03fed762
Add check for unicode normalization in config (#44)
This ensures that the config sections match Wikipedia's Unicode
normalization. We could also normalize every section in every article to
handle an edge case where non-normalized output is inadvertently created
as tags are joined, but I don't think that's worth it yet.

From <https://mediawiki.org/wiki/Unicode_normalization_considerations>:

> MediaWiki applies normalization form C (NFC) to Unicode text input.

> MediaWiki doesn't apply any normalization to its output, for example
> `cafe<nowiki/>́` becomes "café" (shows U+0065 U+0301 in a row,
> without precomposed characters like U+00E9 appearing).

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-04-24 10:34:11 -04:00
Lens0021 / Leslie
19d9f2c42a
Exclude more Korean sections (#43)
* Exclude more Korean sections

- "각주 및 참고 문헌": notes and references
- "같이 읽기": read also
- "관련 항목":  related items
- "관련 홈페이지": related web sites
- "더 보기": see more
- "더 읽어보기": read more
- "둘러보기": look around
- "외부 링크 및 참고 자료": external links and references
- "외부 영상": external videos
- "외부링크": external links
- "주해": notes
- "참고 문헌 및 링크": references and links
- "참고 문헌": references
- "참고 서적": references
- "참고": notes
- "참고자료": referenced data
- "참조 문헌": references
- "참조 자료": referenced data
- "참조 항목": referenced items
- "참조": references

* Undo exclude "둘러보기", which is used as 'Nearby places'
2024-04-21 17:22:27 +02:00
Lens0021 / Leslie
b2eabf8538 Add Korean language
Signed-off-by: Lens0021 / Leslie <lorentz0021@gmail.com>
2024-04-21 14:56:24 +02:00
Jonathan Davies
a4caf23e35 Cargo.toml: Enabled LTO for release build. 2024-03-15 01:59:48 +01:00
Evan Lloyd New-Schmidt
2dc0b1bc34 Log redirect name instead of article on error
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
580b60bdd4 Use tracing with pid, lang, title, url, line, and byte fields
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
bfdb3c17a9 Detect language with simplify subcommand
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
7d453d5e63 Reorganize html module
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
b5f0b22f7a Only modify attributes on attached elements
I've seen a 10-20% speedup on larger articles.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
cd5132f28c Fix panics on empty trees
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
9858994f93 Create output directory after processing html
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
1da8ec212a Add checks for article redirects, empty articles, and sniff language
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Alexander Borsuk
e31027893e Exclude notes and references for en
Signed-off-by: Alexander Borsuk <me@alex.bio>
2024-01-20 02:06:26 +02:00
6d8671dbc7 Bump checkout to v4
Signed-off-by: Jean-BaptisteC <jeanbaptiste.charron@outlook.fr>
2023-11-09 08:07:23 +01:00
Evan Lloyd New-Schmidt
af16cb6513 Exit early with error if any jobs fail
I've tried a number of ways to do this, and this has been the simplest
and most reliable.

- Catches jobs that exit before calling the function.
- Doesn't mess with the kill_jobs hook and leave orphan processes.
- Bubbles up the exit code.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-10-04 12:20:40 -04:00
Evan Lloyd New-Schmidt
99c3b72e51 Add option to use existing tag file
This makes testing the script behavior much faster.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-10-04 12:20:40 -04:00
Evan Lloyd New-Schmidt
54230ee4ff
Add portuguese language (#34)
Relates to organicmaps/organicmaps#6153

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Co-authored-by: Matheus Gomes <86851490+matheusgomesms@users.noreply.github.com>
2023-10-02 11:56:20 -04:00
Evan Lloyd New-Schmidt
9d1ad01f33
Improve script warnings/errors (#32)
- Warn on unexpected file extensions
- Move filename to end of errors

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:12:36 -04:00
Evan Lloyd New-Schmidt
29d90376f3 Move file parsing out of wm module
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
faf4b760b2 Add OSM object metadata
- Add @otype and @version columns to the `get-tags` output.
- Parse @otype, @oname, and @version columns in osm tagfiles.
- Attach and output available metadata in the `tag-errors` command.

OSM ids are not shared across nodes, ways, and relations, so the object
type should be saved as well. Including the edit version will make it
easier to see if a mis-tagged object is outdated.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
a584498c65 Add additional checks for langs/titles
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
292eeac081 Add command to write tag errors to file
- Write a TSV file with the line number, error, and input text.
- Include OSM object id if available in tag file.
- Update run script to write file once before extracting.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
218e55931f Remove debug info from release builds
A special 'bench' profile can be used instead.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
c3162c9fed Reduce unnecessary rebuilds
- Only embed commit on release builds.
- Add CI and scripts to excluded cargo files.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
33174511dd Handle simplification panics
I've tested manually and it:
- handles panics with a static message or formatted arguments
- logs an error instead of exiting (backtraces are still printed)
- writes any panic-causing html to an `errors/` subdirectory

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-26 15:08:33 -04:00
Evan Lloyd New-Schmidt
3de06a3209 Disable printing backtraces by default.
The caught html panics still print backtraces. Disabling it in rust
would require changing the global panic handler when entering and
exiting the function.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-26 15:08:33 -04:00
Evan Lloyd New-Schmidt
481ace45ce
Add Download script (#22)
- Downloads latest enterprise dumps in requested languages
- Uses parallel downloading with wget2 if available
- Dumps are stored in subdirectories by date

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-26 11:45:07 -04:00
Evan Lloyd New-Schmidt
8191c36a5e Format let-else
Prior to rust 1.72.0, rustfmt ignored let-else statements:
https://blog.rust-lang.org/2023/07/01/rustfmt-supports-let-else-statements.html

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-26 11:13:29 -04:00
Evan Lloyd New-Schmidt
d0480e9089 Make thread pool control similar to osmium
- Allow setting number relative to number of cores
- Default to Cores - 2 threads
- Add env variable OM_POOL_THREADS (lower priority than CLI)
- Rename CLI option to `-t/--threads`

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-01 12:03:58 -04:00
Evan Lloyd New-Schmidt
c7fe34f3ad Remove header ids
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
b96c2cf4db Refactor simplification
- Combine expansion steps
- Pull original steps into functions
- Use parent sections for removing specific headers
- Remove head in initial stage

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
6c02f4a569 Remove coordinates from output
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
c4028e52fa Preserve excerpts
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
3d3ecb52b2 Minify whitespace between elements
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
81783695d5 Remove doctype and html element
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
58f32b43fd Remove empty sections after other removals
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
cc3ae9b629 Remove "(listen)" text
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
81f528a350 Expand spans, sections, and body after removing head
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
0a0a94b484 Remove comments
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00