Some articles use non-breaking spaces between quantities and units,
which Wikipedia seems to wrap with a span. Elements with no or
whitespace-only text were previously removed to prune `<link>`s and
parents of other removed elements.
This fix preserves the internal whitespace of elements that would
otherwise be removed for being "empty". It does not distinguish between
"meaningful" whitespace and padding between elements that would be
collapsed by HTML formatting rules. It also cannot distinguish between
elements that _started_ with only whitespace and nodes that now contain
only whitespace after previous steps. The preserved whitespace in the
latter case is unlikely to remain because of later processing steps.
Fixes#47, fixesorganicmaps/organicmaps#8651
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Whitespace behavior is different between Html::html and this
half-working pretty printer. Now the tests match the parser output
exactly.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
The initial section list is based on the top 200 headers in matched
articles passed through Google Translate.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
This makes it possible to use a Set in one place and a Vec in another,
to log or count items without allocating a collection for all of them,
and to ignore errors with no overhead.
The alternative is converting them to custom iterators, which is more work
than I want to do right now.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
This ensures that the config sections match Wikipedia's Unicode
normalization. We could also normalize every section in every article to
handle an edge case where non-normalized output is inadvertently created
as tags are joined, but I don't think that's worth it yet.
From <https://mediawiki.org/wiki/Unicode_normalization_considerations>:
> MediaWiki applies normalization form C (NFC) to Unicode text input.
> MediaWiki doesn't apply any normalization to its output, for example
> `cafe<nowiki/>́` becomes "café" (shows U+0065 U+0301 in a row,
> without precomposed characters like U+00E9 appearing).
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
* Exclude more Korean sections
- "각주 및 참고 문헌": notes and references
- "같이 읽기": read also
- "관련 항목": related items
- "관련 홈페이지": related web sites
- "더 보기": see more
- "더 읽어보기": read more
- "둘러보기": look around
- "외부 링크 및 참고 자료": external links and references
- "외부 영상": external videos
- "외부링크": external links
- "주해": notes
- "참고 문헌 및 링크": references and links
- "참고 문헌": references
- "참고 서적": references
- "참고": notes
- "참고자료": referenced data
- "참조 문헌": references
- "참조 자료": referenced data
- "참조 항목": referenced items
- "참조": references
* Undo exclude "둘러보기", which is used as 'Nearby places'
I've tried a number of ways to do this, and this has been the simplest
and most reliable.
- Catches jobs that exit before calling the function.
- Doesn't mess with the kill_jobs hook and leave orphan processes.
- Bubbles up the exit code.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Add @otype and @version columns to the `get-tags` output.
- Parse @otype, @oname, and @version columns in osm tagfiles.
- Attach and output available metadata in the `tag-errors` command.
OSM ids are not shared across nodes, ways, and relations, so the object
type should be saved as well. Including the edit version will make it
easier to see if a mis-tagged object is outdated.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Write a TSV file with the line number, error, and input text.
- Include OSM object id if available in tag file.
- Update run script to write file once before extracting.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
I've tested manually and it:
- handles panics with a static message or formatted arguments
- logs an error instead of exiting (backtraces are still printed)
- writes any panic-causing html to an `errors/` subdirectory
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
The caught html panics still print backtraces. Disabling it in rust
would require changing the global panic handler when entering and
exiting the function.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Downloads latest enterprise dumps in requested languages
- Uses parallel downloading with wget2 if available
- Dumps are stored in subdirectories by date
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Combine expansion steps
- Pull original steps into functions
- Use parent sections for removing specific headers
- Remove head in initial stage
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>