Wikipedia parser that generates offline content embeddable into Organic Maps map mwm files
Find a file
Evan Lloyd New-Schmidt 9036e3413f
Write to generator-compatible folder structure (#6)
The map generator expects a certain folder structure created by the
current scraper to add the article content into the mwm files.

- Article html is written to wikidata directory.
- Directories are created for any matched titles and symlinked to the
  wikidata directory.
- Articles without a QID are written to article title directory.
- Article titles containing `/` are not escaped, so multiple
  subdirectories are possible.

The output folder hierarchy looks like this:

    .
    ├── de.wikipedia.org
    │  └── wiki
    │     ├── Coal_River_Springs_Territorial_Park
    │     │  ├── de.html
    │     │  └── ru.html
    │     ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park
    │     │  ├── de.html
    │     │  └── en.html
    │    ...
    ├── en.wikipedia.org
    │  └── wiki
    │     ├── Arctic_National_Wildlife_Refuge
    │     │  ├── de.html
    │     │  ├── en.html
    │     │  ├── es.html
    │     │  ├── fr.html
    │     │  └── ru.html
    │     ├── Baltimore
    │     │  └── Washington_International_Airport
    │     │     ├── de.html
    │     │     ├── en.html
    │     │     ├── es.html
    │     │     ├── fr.html
    │     │     └── ru.html
    │    ...
    └── wikidata
       ├── Q59320
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
       ├── Q120306
       │  ├── de.html
       │  ├── en.html
       │  ├── es.html
       │  ├── fr.html
       │  └── ru.html
      ...

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-07-10 10:34:20 -04:00
.github/workflows Setup GitHub (#2) 2023-06-01 09:25:35 +02:00
benches Add id parsing benchmarks 2023-06-23 15:50:04 -04:00
src Write to generator-compatible folder structure (#6) 2023-07-10 10:34:20 -04:00
.gitignore Initial rust setup (#1) 2023-05-30 19:00:05 +02:00
article_processing_config.json Add support for multiple languages 2023-06-23 15:50:04 -04:00
Cargo.lock Add support for multiple languages 2023-06-23 15:50:04 -04:00
Cargo.toml Add support for multiple languages 2023-06-23 15:50:04 -04:00
LICENSE Initial commit 2023-05-30 16:01:35 +02:00
README.md Write to generator-compatible folder structure (#6) 2023-07-10 10:34:20 -04:00

wikiparser

Extracts articles from Wikipedia database dumps for embedding into the mwm map files created by the Organic Maps generator.

Configuring

article_processing_config.json should be updated when adding a new language. It defines article sections that are not important for users and should be removed from the extracted HTML.

Usage

First, install the rust language tools

For best performance, use --release when building or running.

You can run the program from within this directory using cargo run --release --.

Alternatively, build it with cargo build --release, which places the binary in ./target/release/om-wikiparser.

Run the program with the --help flag to see all supported arguments.

It takes as inputs:

  • A wikidata enterprise JSON dump, extracted and connected to stdin.
  • A file of Wikidata QIDs to extract, one per line (e.g. Q12345), passed as the CLI flag --wikidata-ids.
  • A file of Wikipedia article titles to extract, one per line (e.g. https://$LANG.wikipedia.org/wiki/$ARTICLE_TITLE), passed as a CLI flag --wikipedia-urls.
  • A directory to write the extracted articles to, as a CLI argument.

As an example of usage with the map generator:

  • Assuming this program is installed to $PATH as om-wikiparser.
  • Download the dumps in the desired languages (Use the files with the format ${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz). Set DUMP_DOWNLOAD_DIR to the location they are downloaded.
  • Run the following from within the intermediate_data subdirectory of the maps build directory:
# Transform intermediate files from generator.
cut -f 2 id_to_wikidata.csv > wikidata_ids.txt
tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt
# Begin extraction.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
  tar xzf $dump | om-wikiparser \
    --wikidata-ids wikidata_ids.txt \
    --wikipedia-urls wikipedia_urls.txt \
    descriptions/
done