Reorganize README for readability
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
This commit is contained in:
parent
775e23cf1e
commit
1245d6365a
1 changed files with 87 additions and 95 deletions
182
README.md
182
README.md
|
@ -3,16 +3,89 @@
|
|||
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
|
||||
|
||||
Extracted articles are identified by Wikipedia article titles in url or text form (language-specific), and [Wikidata QIDs](https://www.wikidata.org/wiki/Wikidata:Glossary#QID) (language-agnostic).
|
||||
OpenStreetMap commonly stores these as [`wikipedia*=`](https://wiki.openstreetmap.org/wiki/Key:wikipedia) and [`wikidata=`](https://wiki.openstreetmap.org/wiki/Key:wikidata) tags on objects.
|
||||
OpenStreetMap (OSM) commonly stores these as [`wikipedia*=`](https://wiki.openstreetmap.org/wiki/Key:wikipedia) and [`wikidata=`](https://wiki.openstreetmap.org/wiki/Key:wikidata) tags on objects.
|
||||
|
||||
## Configuring
|
||||
|
||||
[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language.
|
||||
[`article_processing_config.json`](article_processing_config.json) is _compiled with the program_ and should be updated when adding a new language.
|
||||
It defines article sections that are not important for users and should be removed from the extracted HTML.
|
||||
There are some tests for basic validation of the file, run them with `cargo test`.
|
||||
|
||||
## Downloading Dumps
|
||||
## Usage
|
||||
|
||||
[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/). Please note that each language's dump is tens of gigabytes in size.
|
||||
> [!NOTE]
|
||||
> In production, wikiparser is run with the maps generator, which is somewhat involved to set up. See [Usage with Maps Generator](#usage-with-maps-generator) for more info.
|
||||
|
||||
To run the wikiparser for development and testing, see below.
|
||||
|
||||
First, install [the rust language tools](https://www.rust-lang.org/)
|
||||
|
||||
> [!IMPORTANT]
|
||||
> For best performance, use `-r`/`--release` with `cargo build`/`run`.
|
||||
|
||||
You can run the program from within this directory using `cargo run --release --`.
|
||||
|
||||
Alternatively, build it with `cargo build --release`, which places the binary in `./target/release/om-wikiparser`.
|
||||
|
||||
Run the program with the `--help` flag to see all supported arguments.
|
||||
|
||||
```
|
||||
$ cargo run -- --help
|
||||
A set of tools to extract articles from Wikipedia Enterprise HTML dumps selected by OpenStreetMap tags.
|
||||
|
||||
Usage: om-wikiparser <COMMAND>
|
||||
|
||||
Commands:
|
||||
get-tags Extract wikidata/wikipedia tags from an OpenStreetMap PBF dump
|
||||
check-tags Attempt to parse extracted OSM tags and write errors to stdout in TSV format
|
||||
get-articles Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps
|
||||
simplify Apply html simplification to a single article
|
||||
help Print this message or the help of the given subcommand(s)
|
||||
|
||||
Options:
|
||||
-h, --help
|
||||
Print help (see a summary with '-h')
|
||||
|
||||
-V, --version
|
||||
Print version
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> Each subcommand has additional help.
|
||||
|
||||
The main work is done in the `get-articles` subcommand.
|
||||
It takes as inputs:
|
||||
- A [Wikipedia Enterprise JSON dump](#downloading-wikipedia-dumps), decompressed and connected to `stdin`.
|
||||
- A directory to write the extracted articles to, as a CLI argument.
|
||||
- Any number of filters for the articles:
|
||||
- Use `--osm-tags` if you have an [OSM .pbf file](#downloading-openstreetmap-osm-files) and can use the `get-tags` subcommand or the `osmconvert` tool.
|
||||
- Use `--wikidata-qids` or `--wikipedia-urls` if you have a group of urls or QIDs from another source.
|
||||
|
||||
To test a single language in a specific map region, first get the matching tags for the region with `get-tags`:
|
||||
```sh
|
||||
cargo run -r -- get-tags $REGION_EXTRACT.pbf > region-tags.tsv
|
||||
```
|
||||
|
||||
Then write the articles to a directory with `get-articles`:
|
||||
```sh
|
||||
tar xzOf $dump | cargo run -r -- get-articles --osm-tags region-tags.tsv $OUTPUT_DIR
|
||||
```
|
||||
|
||||
## Downloading OpenStreetMap (OSM) files
|
||||
|
||||
To extract Wikipedia tags with the `get-tags` subcommand, you need a file in the [OSM `.pbf` format](https://wiki.openstreetmap.org/wiki/PBF_Format).
|
||||
|
||||
The "planet" file is [available directly from OSM](https://wiki.openstreetmap.org/wiki/Planet.osm) but is ~80GB in size; for testing you can [try a smaller region's data (called "Extracts") from one of the many providers](https://wiki.openstreetmap.org/wiki/Planet.osm#Extracts).
|
||||
|
||||
## Downloading Wikipedia Dumps
|
||||
|
||||
[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/).
|
||||
|
||||
> [!WARNING]
|
||||
> Each language's dump is tens of gigabytes in size, and much larger when decompressed.
|
||||
> To avoid storing the decompressed data, pipe it directly into the wikiparser as described in [Usage](#usage).
|
||||
|
||||
To test a small number of articles, you can also use the [On-Demand API](https://enterprise.wikimedia.com/docs/on-demand/) to download them, which has a free tier.
|
||||
|
||||
Wikimedia requests no more than 2 concurrent downloads, which the included [`download.sh`](./download.sh) script respects:
|
||||
> If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2.
|
||||
|
@ -46,93 +119,12 @@ It maintains a directory with the following layout:
|
|||
...
|
||||
```
|
||||
|
||||
## Usage
|
||||
## Usage with Maps Generator
|
||||
|
||||
To use with the map generator, see the [`run.sh` script](run.sh) and its own help documentation.
|
||||
To use with the [maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md), see the [`run.sh` script](run.sh) and its own help documentation.
|
||||
It handles extracting the tags, using multiple dumps, and re-running to convert titles to QIDs and extract them across languages.
|
||||
|
||||
To run the wikiparser manually or for development, see below.
|
||||
|
||||
First, install [the rust language tools](https://www.rust-lang.org/)
|
||||
|
||||
For best performance, use `--release` when building or running.
|
||||
|
||||
You can run the program from within this directory using `cargo run --release --`.
|
||||
|
||||
Alternatively, build it with `cargo build --release`, which places the binary in `./target/release/om-wikiparser`.
|
||||
|
||||
Run the program with the `--help` flag to see all supported arguments.
|
||||
|
||||
```
|
||||
$ cargo run --release -- --help
|
||||
A set of tools to extract articles from Wikipedia Enterprise HTML dumps selected by OpenStreetMap tags.
|
||||
|
||||
Usage: om-wikiparser <COMMAND>
|
||||
|
||||
Commands:
|
||||
get-tags Extract wikidata/wikipedia tags from an OpenStreetMap PBF dump
|
||||
check-tags Attempt to parse extracted OSM tags and write errors to stdout in TSV format
|
||||
get-articles Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps
|
||||
simplify Apply html simplification to a single article
|
||||
help Print this message or the help of the given subcommand(s)
|
||||
|
||||
Options:
|
||||
-h, --help
|
||||
Print help (see a summary with '-h')
|
||||
|
||||
-V, --version
|
||||
Print version
|
||||
```
|
||||
|
||||
Each command has its own additional help:
|
||||
|
||||
```
|
||||
$ cargo run -- get-articles --help
|
||||
|
||||
Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps.
|
||||
|
||||
Expects an uncompressed dump (newline-delimited JSON) connected to stdin.
|
||||
|
||||
Usage: om-wikiparser get-articles [OPTIONS] <OUTPUT_DIR>
|
||||
|
||||
Arguments:
|
||||
<OUTPUT_DIR>
|
||||
Directory to write the extracted articles to
|
||||
|
||||
Options:
|
||||
--write-new-qids <FILE>
|
||||
Append to the provided file path the QIDs of articles matched by title but not QID.
|
||||
|
||||
Use this to save the QIDs of articles you know the url of, but not the QID. The same path can later be passed to the `--wikidata-qids` option to extract them from another language's dump. Writes are atomicly appended to the file, so the same path may be used by multiple concurrent instances.
|
||||
|
||||
--no-simplify
|
||||
Don't process extracted HTML; write the original text to disk
|
||||
|
||||
-h, --help
|
||||
Print help (see a summary with '-h')
|
||||
|
||||
FILTERS:
|
||||
--osm-tags <FILE.tsv>
|
||||
Path to a TSV file that contains one or more of `wikidata`, `wikipedia` columns.
|
||||
|
||||
This can be generated with the `get-tags` command or `osmconvert --csv-headline --csv 'wikidata wikipedia'`.
|
||||
|
||||
--wikidata-qids <FILE>
|
||||
Path to file that contains a Wikidata QID to extract on each line (e.g. `Q12345`)
|
||||
|
||||
--wikipedia-urls <FILE>
|
||||
Path to file that contains a Wikipedia article url to extract on each line (e.g. `https://lang.wikipedia.org/wiki/Article_Title`)
|
||||
```
|
||||
|
||||
It takes as inputs:
|
||||
- A wikidata enterprise JSON dump, extracted and connected to `stdin`.
|
||||
- A directory to write the extracted articles to, as a CLI argument.
|
||||
- Any number of filters passed:
|
||||
- A TSV file of wikidata qids and wikipedia urls, created by the `get-tags` command or `osmconvert`, passed as the CLI flag `--osm-tags`.
|
||||
- A file of Wikidata QIDs to extract, one per line (e.g. `Q12345`), passed as the CLI flag `--wikidata-ids`.
|
||||
- A file of Wikipedia article titles to extract, one per line (e.g. `https://$LANG.wikipedia.org/wiki/$ARTICLE_TITLE`), passed as a CLI flag `--wikipedia-urls`.
|
||||
|
||||
As an example of manual usage with the map generator:
|
||||
As an example of manual usage with the maps generator:
|
||||
- Assuming this program is installed to `$PATH` as `om-wikiparser`.
|
||||
- Download [the dumps in the desired languages](https://dumps.wikimedia.org/other/enterprise_html/runs/) (Use the files with the format `${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz`).
|
||||
Set `DUMP_DOWNLOAD_DIR` to the location they are downloaded.
|
||||
|
@ -150,8 +142,8 @@ export RUST_LOG=om_wikiparser=debug
|
|||
# Begin extraction.
|
||||
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
|
||||
do
|
||||
tar xzf $dump | om-wikiparser get-articles \
|
||||
--wikidata-ids wikidata_qids.txt \
|
||||
tar xzOf $dump | om-wikiparser get-articles \
|
||||
--wikidata-qids wikidata_qids.txt \
|
||||
--wikipedia-urls wikipedia_urls.txt \
|
||||
--write-new-qids new_qids.txt \
|
||||
descriptions/
|
||||
|
@ -159,8 +151,8 @@ done
|
|||
# Extract discovered QIDs.
|
||||
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
|
||||
do
|
||||
tar xzf $dump | om-wikiparser get-articles \
|
||||
--wikidata-ids new_qids.txt \
|
||||
tar xzOf $dump | om-wikiparser get-articles \
|
||||
--wikidata-qids new_qids.txt \
|
||||
descriptions/
|
||||
done
|
||||
```
|
||||
|
@ -172,7 +164,7 @@ om-wikiparser get-tags planet-latest.osm.pbf > osm_tags.tsv
|
|||
# Begin extraction.
|
||||
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
|
||||
do
|
||||
tar xzf $dump | om-wikiparser get-articles \
|
||||
tar xzOf $dump | om-wikiparser get-articles \
|
||||
--osm-tags osm_tags.tsv \
|
||||
--write-new-qids new_qids.txt \
|
||||
descriptions/
|
||||
|
@ -180,8 +172,8 @@ done
|
|||
# Extract discovered QIDs.
|
||||
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
|
||||
do
|
||||
tar xzf $dump | om-wikiparser get-articles \
|
||||
--wikidata-ids new_qids.txt \
|
||||
tar xzOf $dump | om-wikiparser get-articles \
|
||||
--wikidata-qids new_qids.txt \
|
||||
descriptions/
|
||||
done
|
||||
```
|
||||
|
|
Loading…
Add table
Reference in a new issue