From 98d5a8a95fc8bb261e9e6af38835a2ad2075c011 Mon Sep 17 00:00:00 2001 From: Evan Lloyd New-Schmidt Date: Wed, 16 Aug 2023 17:55:42 -0400 Subject: [PATCH] Mention download.sh in README Signed-off-by: Evan Lloyd New-Schmidt --- README.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/README.md b/README.md index 7f2bd0d..4b4f973 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,32 @@ OpenStreetMap commonly stores these as [`wikipedia*=`](https://wiki.openstreetma [`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language. It defines article sections that are not important for users and should be removed from the extracted HTML. +## Downloading Dumps + +[Enterprise HTML dumps, updated twice a month. are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/). + +For the wikiparser you'll want the ["NS0"](https://en.wikipedia.org/wiki/Wikipedia:Namespace) "ENTERPRISE-HTML" `.json.tar.gz` files. + +They are gzipped tar files containing a single file of newline-delimited JSON matching the [Wikimedia Enterprise API schema](https://enterprise.wikimedia.com/docs/data-dictionary/). + +The included [`download.sh`](./download.sh) script handles downloading the latest set of dumps in specific languages. +It maintains a directory with the following layout: +``` +/ +├── latest -> 20230701/ +├── 20230701/ +│ ├── dewiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz +│ ├── enwiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz +│ ├── eswiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz +│ ... +├── 20230620/ +│ ├── dewiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz +│ ├── enwiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz +│ ├── eswiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz +│ ... +... +``` + ## Usage To use with the map generator, see the [`run.sh` script](run.sh) and its own help documentation.