Commit graph

57 commits

Author SHA1 Message Date
Evan Lloyd New-Schmidt
b5f0b22f7a Only modify attributes on attached elements
I've seen a 10-20% speedup on larger articles.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
cd5132f28c Fix panics on empty trees
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
9858994f93 Create output directory after processing html
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
1da8ec212a Add checks for article redirects, empty articles, and sniff language
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Alexander Borsuk
e31027893e Exclude notes and references for en
Signed-off-by: Alexander Borsuk <me@alex.bio>
2024-01-20 02:06:26 +02:00
6d8671dbc7 Bump checkout to v4
Signed-off-by: Jean-BaptisteC <jeanbaptiste.charron@outlook.fr>
2023-11-09 08:07:23 +01:00
Evan Lloyd New-Schmidt
af16cb6513 Exit early with error if any jobs fail
I've tried a number of ways to do this, and this has been the simplest
and most reliable.

- Catches jobs that exit before calling the function.
- Doesn't mess with the kill_jobs hook and leave orphan processes.
- Bubbles up the exit code.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-10-04 12:20:40 -04:00
Evan Lloyd New-Schmidt
99c3b72e51 Add option to use existing tag file
This makes testing the script behavior much faster.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-10-04 12:20:40 -04:00
Evan Lloyd New-Schmidt
54230ee4ff
Add portuguese language (#34)
Relates to organicmaps/organicmaps#6153

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Co-authored-by: Matheus Gomes <86851490+matheusgomesms@users.noreply.github.com>
2023-10-02 11:56:20 -04:00
Evan Lloyd New-Schmidt
9d1ad01f33
Improve script warnings/errors (#32)
- Warn on unexpected file extensions
- Move filename to end of errors

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:12:36 -04:00
Evan Lloyd New-Schmidt
29d90376f3 Move file parsing out of wm module
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
faf4b760b2 Add OSM object metadata
- Add @otype and @version columns to the `get-tags` output.
- Parse @otype, @oname, and @version columns in osm tagfiles.
- Attach and output available metadata in the `tag-errors` command.

OSM ids are not shared across nodes, ways, and relations, so the object
type should be saved as well. Including the edit version will make it
easier to see if a mis-tagged object is outdated.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
a584498c65 Add additional checks for langs/titles
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
292eeac081 Add command to write tag errors to file
- Write a TSV file with the line number, error, and input text.
- Include OSM object id if available in tag file.
- Update run script to write file once before extracting.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
218e55931f Remove debug info from release builds
A special 'bench' profile can be used instead.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
c3162c9fed Reduce unnecessary rebuilds
- Only embed commit on release builds.
- Add CI and scripts to excluded cargo files.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-29 16:11:29 -04:00
Evan Lloyd New-Schmidt
33174511dd Handle simplification panics
I've tested manually and it:
- handles panics with a static message or formatted arguments
- logs an error instead of exiting (backtraces are still printed)
- writes any panic-causing html to an `errors/` subdirectory

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-26 15:08:33 -04:00
Evan Lloyd New-Schmidt
3de06a3209 Disable printing backtraces by default.
The caught html panics still print backtraces. Disabling it in rust
would require changing the global panic handler when entering and
exiting the function.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-26 15:08:33 -04:00
Evan Lloyd New-Schmidt
481ace45ce
Add Download script (#22)
- Downloads latest enterprise dumps in requested languages
- Uses parallel downloading with wget2 if available
- Dumps are stored in subdirectories by date

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-26 11:45:07 -04:00
Evan Lloyd New-Schmidt
8191c36a5e Format let-else
Prior to rust 1.72.0, rustfmt ignored let-else statements:
https://blog.rust-lang.org/2023/07/01/rustfmt-supports-let-else-statements.html

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-26 11:13:29 -04:00
Evan Lloyd New-Schmidt
d0480e9089 Make thread pool control similar to osmium
- Allow setting number relative to number of cores
- Default to Cores - 2 threads
- Add env variable OM_POOL_THREADS (lower priority than CLI)
- Rename CLI option to `-t/--threads`

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-09-01 12:03:58 -04:00
Evan Lloyd New-Schmidt
c7fe34f3ad Remove header ids
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
b96c2cf4db Refactor simplification
- Combine expansion steps
- Pull original steps into functions
- Use parent sections for removing specific headers
- Remove head in initial stage

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
6c02f4a569 Remove coordinates from output
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
c4028e52fa Preserve excerpts
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
3d3ecb52b2 Minify whitespace between elements
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
81783695d5 Remove doctype and html element
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
58f32b43fd Remove empty sections after other removals
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
cc3ae9b629 Remove "(listen)" text
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
81f528a350 Expand spans, sections, and body after removing head
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
0a0a94b484 Remove comments
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
4b776f49d4 Add denylist from Extracts API
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
75fa04407d Add snapshot tests for html output
- Article contents are from the 2023-04-01 Wikipedia Enterprise Dump
- Add benchmark for HTML processing

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
32cd084f3f Add simplification logging
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
c9eb7a160a Add option to not simplify when extracting
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
941d2b1032 Structure parse errors and only log warning if above threshold
- Add custom error types with `thiserror` crate in preparation for #25.
- Parsing errors are captured instead of logged to `warn` by default.
    - All parsing errors are still logged to `debug` level.
    - If >= 0.02% of tags can't be parsed, an error is logged.
    - TSV line errors are always logged as errors.
    - I/O errors will fail instead of be logged.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
34bb9318d5 Refactor and rename title/qid wrappers
- Move Qid and Title to separate modules
- Reformat benchmark

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
bdf6f1a68c Improve url handling
- Check for urls in osm tags
- Handle mobile urls

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
6d242a62aa Extract tags in parallel in rust
- Use rayon and osmpbf crates, output intermediate TSV file in the same
  format as osmconvert, for use with the new `--osm-tags` flag.
- Number of threads spawned can be configured with `--procs` flag.
- Replace all wikidata id references with QID.
- Update script and documentation to use new subcommands.
- run.sh now expects a pbf file to extract tags from.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
b6db70f74c Refactor into subcommands
- Use CLI subcommands (e.g. `om-wikiparser get-articles`)
- Move article processing into a separate module
- Convert simplify helper from separate binary to subcommand

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
5df2d8d243 Add new option to parse osm tag file
Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the "--csv" output of `osmconvert`.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-10 09:37:58 -04:00
Evan Lloyd New-Schmidt
0fc43767aa Add script
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
d6e892343b Keep charset tags
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
ac556bd3d4 Save and log build commit
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
aa213fbece Make new qid writes atomic
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-07 17:05:03 -04:00
Evan Lloyd New-Schmidt
75f4f6a21b
Add option to dump new QIDs (#20)
This allows us to extract articles that we know the title of but not the QID of from other language's dumps in a another pass.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-13 14:04:52 -04:00
Evan Lloyd New-Schmidt
45efd77c0d Remove images and links
See #11 for next steps

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:58:56 -04:00
Evan Lloyd New-Schmidt
9036e3413f
Write to generator-compatible folder structure (#6)
The map generator expects a certain folder structure created by the
current scraper to add the article content into the mwm files.

- Article html is written to wikidata directory.
- Directories are created for any matched titles and symlinked to the
  wikidata directory.
- Articles without a QID are written to article title directory.
- Article titles containing `/` are not escaped, so multiple
  subdirectories are possible.

The output folder hierarchy looks like this:

    .
    ├── de.wikipedia.org
    │  └── wiki
    │     ├── Coal_River_Springs_Territorial_Park
    │     │  ├── de.html
    │     │  └── ru.html
    │     ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park
    │     │  ├── de.html
    │     │  └── en.html
    │    ...
    ├── en.wikipedia.org
    │  └── wiki
    │     ├── Arctic_National_Wildlife_Refuge
    │     │  ├── de.html
    │     │  ├── en.html
    │     │  ├── es.html
    │     │  ├── fr.html
    │     │  └── ru.html
    │     ├── Baltimore
    │     │  └── Washington_International_Airport
    │     │     ├── de.html
    │     │     ├── en.html
    │     │     ├── es.html
    │     │     ├── fr.html
    │     │     └── ru.html
    │    ...
    └── wikidata
       ├── Q59320
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
       ├── Q120306
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
      ...

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:34:20 -04:00
Evan Lloyd New-Schmidt
bb1f897cd2 Add checks for whitespace/empty strings in ids and titles
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00
Evan Lloyd New-Schmidt
0a0317538c Rewrite comments as sentences for readability
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-06-23 15:50:04 -04:00