Reorganize README for readability

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
This commit is contained in:
Evan Lloyd New-Schmidt 2024-03-15 12:40:44 -04:00 committed by Evan Lloyd New-Schmidt
parent 775e23cf1e
commit 1245d6365a

182
README.md
View file

@ -3,16 +3,89 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
Extracted articles are identified by Wikipedia article titles in url or text form (language-specific), and [Wikidata QIDs](https://www.wikidata.org/wiki/Wikidata:Glossary#QID) (language-agnostic).
OpenStreetMap commonly stores these as [`wikipedia*=`](https://wiki.openstreetmap.org/wiki/Key:wikipedia) and [`wikidata=`](https://wiki.openstreetmap.org/wiki/Key:wikidata) tags on objects.
OpenStreetMap (OSM) commonly stores these as [`wikipedia*=`](https://wiki.openstreetmap.org/wiki/Key:wikipedia) and [`wikidata=`](https://wiki.openstreetmap.org/wiki/Key:wikidata) tags on objects.
## Configuring
[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language.
[`article_processing_config.json`](article_processing_config.json) is _compiled with the program_ and should be updated when adding a new language.
It defines article sections that are not important for users and should be removed from the extracted HTML.
There are some tests for basic validation of the file, run them with `cargo test`.
## Downloading Dumps
## Usage
[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/). Please note that each language's dump is tens of gigabytes in size.
> [!NOTE]
> In production, wikiparser is run with the maps generator, which is somewhat involved to set up. See [Usage with Maps Generator](#usage-with-maps-generator) for more info.
To run the wikiparser for development and testing, see below.
First, install [the rust language tools](https://www.rust-lang.org/)
> [!IMPORTANT]
> For best performance, use `-r`/`--release` with `cargo build`/`run`.
You can run the program from within this directory using `cargo run --release --`.
Alternatively, build it with `cargo build --release`, which places the binary in `./target/release/om-wikiparser`.
Run the program with the `--help` flag to see all supported arguments.
```
$ cargo run -- --help
A set of tools to extract articles from Wikipedia Enterprise HTML dumps selected by OpenStreetMap tags.
Usage: om-wikiparser <COMMAND>
Commands:
get-tags Extract wikidata/wikipedia tags from an OpenStreetMap PBF dump
check-tags Attempt to parse extracted OSM tags and write errors to stdout in TSV format
get-articles Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps
simplify Apply html simplification to a single article
help Print this message or the help of the given subcommand(s)
Options:
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
```
> [!NOTE]
> Each subcommand has additional help.
The main work is done in the `get-articles` subcommand.
It takes as inputs:
- A [Wikipedia Enterprise JSON dump](#downloading-wikipedia-dumps), decompressed and connected to `stdin`.
- A directory to write the extracted articles to, as a CLI argument.
- Any number of filters for the articles:
- Use `--osm-tags` if you have an [OSM .pbf file](#downloading-openstreetmap-osm-files) and can use the `get-tags` subcommand or the `osmconvert` tool.
- Use `--wikidata-qids` or `--wikipedia-urls` if you have a group of urls or QIDs from another source.
To test a single language in a specific map region, first get the matching tags for the region with `get-tags`:
```sh
cargo run -r -- get-tags $REGION_EXTRACT.pbf > region-tags.tsv
```
Then write the articles to a directory with `get-articles`:
```sh
tar xzOf $dump | cargo run -r -- get-articles --osm-tags region-tags.tsv $OUTPUT_DIR
```
## Downloading OpenStreetMap (OSM) files
To extract Wikipedia tags with the `get-tags` subcommand, you need a file in the [OSM `.pbf` format](https://wiki.openstreetmap.org/wiki/PBF_Format).
The "planet" file is [available directly from OSM](https://wiki.openstreetmap.org/wiki/Planet.osm) but is ~80GB in size; for testing you can [try a smaller region's data (called "Extracts") from one of the many providers](https://wiki.openstreetmap.org/wiki/Planet.osm#Extracts).
## Downloading Wikipedia Dumps
[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/).
> [!WARNING]
> Each language's dump is tens of gigabytes in size, and much larger when decompressed.
> To avoid storing the decompressed data, pipe it directly into the wikiparser as described in [Usage](#usage).
To test a small number of articles, you can also use the [On-Demand API](https://enterprise.wikimedia.com/docs/on-demand/) to download them, which has a free tier.
Wikimedia requests no more than 2 concurrent downloads, which the included [`download.sh`](./download.sh) script respects:
> If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2.
@ -46,93 +119,12 @@ It maintains a directory with the following layout:
...
```
## Usage
## Usage with Maps Generator
To use with the map generator, see the [`run.sh` script](run.sh) and its own help documentation.
To use with the [maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md), see the [`run.sh` script](run.sh) and its own help documentation.
It handles extracting the tags, using multiple dumps, and re-running to convert titles to QIDs and extract them across languages.
To run the wikiparser manually or for development, see below.
First, install [the rust language tools](https://www.rust-lang.org/)
For best performance, use `--release` when building or running.
You can run the program from within this directory using `cargo run --release --`.
Alternatively, build it with `cargo build --release`, which places the binary in `./target/release/om-wikiparser`.
Run the program with the `--help` flag to see all supported arguments.
```
$ cargo run --release -- --help
A set of tools to extract articles from Wikipedia Enterprise HTML dumps selected by OpenStreetMap tags.
Usage: om-wikiparser <COMMAND>
Commands:
get-tags Extract wikidata/wikipedia tags from an OpenStreetMap PBF dump
check-tags Attempt to parse extracted OSM tags and write errors to stdout in TSV format
get-articles Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps
simplify Apply html simplification to a single article
help Print this message or the help of the given subcommand(s)
Options:
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
```
Each command has its own additional help:
```
$ cargo run -- get-articles --help
Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps.
Expects an uncompressed dump (newline-delimited JSON) connected to stdin.
Usage: om-wikiparser get-articles [OPTIONS] <OUTPUT_DIR>
Arguments:
<OUTPUT_DIR>
Directory to write the extracted articles to
Options:
--write-new-qids <FILE>
Append to the provided file path the QIDs of articles matched by title but not QID.
Use this to save the QIDs of articles you know the url of, but not the QID. The same path can later be passed to the `--wikidata-qids` option to extract them from another language's dump. Writes are atomicly appended to the file, so the same path may be used by multiple concurrent instances.
--no-simplify
Don't process extracted HTML; write the original text to disk
-h, --help
Print help (see a summary with '-h')
FILTERS:
--osm-tags <FILE.tsv>
Path to a TSV file that contains one or more of `wikidata`, `wikipedia` columns.
This can be generated with the `get-tags` command or `osmconvert --csv-headline --csv 'wikidata wikipedia'`.
--wikidata-qids <FILE>
Path to file that contains a Wikidata QID to extract on each line (e.g. `Q12345`)
--wikipedia-urls <FILE>
Path to file that contains a Wikipedia article url to extract on each line (e.g. `https://lang.wikipedia.org/wiki/Article_Title`)
```
It takes as inputs:
- A wikidata enterprise JSON dump, extracted and connected to `stdin`.
- A directory to write the extracted articles to, as a CLI argument.
- Any number of filters passed:
- A TSV file of wikidata qids and wikipedia urls, created by the `get-tags` command or `osmconvert`, passed as the CLI flag `--osm-tags`.
- A file of Wikidata QIDs to extract, one per line (e.g. `Q12345`), passed as the CLI flag `--wikidata-ids`.
- A file of Wikipedia article titles to extract, one per line (e.g. `https://$LANG.wikipedia.org/wiki/$ARTICLE_TITLE`), passed as a CLI flag `--wikipedia-urls`.
As an example of manual usage with the map generator:
As an example of manual usage with the maps generator:
- Assuming this program is installed to `$PATH` as `om-wikiparser`.
- Download [the dumps in the desired languages](https://dumps.wikimedia.org/other/enterprise_html/runs/) (Use the files with the format `${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz`).
Set `DUMP_DOWNLOAD_DIR` to the location they are downloaded.
@ -150,8 +142,8 @@ export RUST_LOG=om_wikiparser=debug
# Begin extraction.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzf $dump | om-wikiparser get-articles \
--wikidata-ids wikidata_qids.txt \
tar xzOf $dump | om-wikiparser get-articles \
--wikidata-qids wikidata_qids.txt \
--wikipedia-urls wikipedia_urls.txt \
--write-new-qids new_qids.txt \
descriptions/
@ -159,8 +151,8 @@ done
# Extract discovered QIDs.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzf $dump | om-wikiparser get-articles \
--wikidata-ids new_qids.txt \
tar xzOf $dump | om-wikiparser get-articles \
--wikidata-qids new_qids.txt \
descriptions/
done
```
@ -172,7 +164,7 @@ om-wikiparser get-tags planet-latest.osm.pbf > osm_tags.tsv
# Begin extraction.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzf $dump | om-wikiparser get-articles \
tar xzOf $dump | om-wikiparser get-articles \
--osm-tags osm_tags.tsv \
--write-new-qids new_qids.txt \
descriptions/
@ -180,8 +172,8 @@ done
# Extract discovered QIDs.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzf $dump | om-wikiparser get-articles \
--wikidata-ids new_qids.txt \
tar xzOf $dump | om-wikiparser get-articles \
--wikidata-qids new_qids.txt \
descriptions/
done
```