Download script #22

Merged
newsch merged 28 commits from download into main 2023-09-26 15:45:08 +00:00
newsch commented 2023-07-18 19:28:30 +00:00 (Migrated from github.com)

Closes #12

Remaining work:

  • Add handling for error conditions
  • Document config and error codes
Closes #12 Remaining work: - [x] Add handling for error conditions - [x] Document config and error codes
biodranik (Migrated from github.com) reviewed 2023-07-20 08:16:08 +00:00
biodranik (Migrated from github.com) commented 2023-07-20 06:03:51 +00:00

In case no new dumps are available, it should just make sure that the latest ones are already downloaded and exit gracefully (and print that).

In case no new dumps are available, it should just make sure that the latest ones are already downloaded and exit gracefully (and print that).
biodranik (Migrated from github.com) commented 2023-07-20 06:04:33 +00:00

set -euxo pipefail is helpful if decide to use pipes in the script.

`set -euxo pipefail` is helpful if decide to use pipes in the script.
biodranik (Migrated from github.com) commented 2023-07-20 06:05:55 +00:00

nit: fewer lines of code are easier to read.

if [ -z "${LANGUAGES+}" ]; then
nit: fewer lines of code are easier to read. ```suggestion if [ -z "${LANGUAGES+}" ]; then ```
biodranik (Migrated from github.com) commented 2023-07-20 06:06:30 +00:00

nit: (here and below)

for lang in $LANGUAGES; do
nit: (here and below) ```suggestion for lang in $LANGUAGES; do ```
biodranik (Migrated from github.com) commented 2023-07-20 06:09:11 +00:00

TMPDIR?

TMPDIR?
biodranik (Migrated from github.com) commented 2023-07-20 08:15:10 +00:00

get_wiki_dump.sh: line 11: 1: unbound variable

get_wiki_dump.sh: line 11: 1: unbound variable
@ -0,0 +141,4 @@
done
if [ -z "$URLS" ]; then
log "No dumps available"
biodranik (Migrated from github.com) commented 2023-07-20 06:07:06 +00:00

"Latest dumps are already downloaded"?

"Latest dumps are already downloaded"?
newsch (Migrated from github.com) reviewed 2023-08-16 21:17:25 +00:00
@ -0,0 +141,4 @@
done
if [ -z "$URLS" ]; then
log "No dumps available"
newsch (Migrated from github.com) commented 2023-08-16 21:17:25 +00:00

If URLS is empty, then none of the specified languages could be found for the latest dump.

If a newer dump isn't available, it will still check the sizes of the last downloaded dump, and exit with 0.

If URLS is empty, then none of the specified languages could be found for the latest dump. If a newer dump isn't available, it will still check the sizes of the last downloaded dump, and exit with 0.
biodranik (Migrated from github.com) reviewed 2023-08-16 22:11:40 +00:00
@ -0,0 +141,4 @@
done
if [ -z "$URLS" ]; then
log "No dumps available"
biodranik (Migrated from github.com) commented 2023-08-16 22:11:40 +00:00

Good! The goal is to make a cron script that will update files automatically when they are published (and delete old files).

Another question: should previously generated HTML and other temporary files be deleted before relaunching the wikiparser? Does it make sense to cover it in the run script?

Good! The goal is to make a cron script that will update files automatically when they are published (and delete old files). Another question: should previously generated HTML and other temporary files be deleted before relaunching the wikiparser? Does it make sense to cover it in the run script?
biodranik (Migrated from github.com) reviewed 2023-08-16 22:19:52 +00:00
biodranik (Migrated from github.com) left a comment

Need to test it on a server )

Need to test it on a server )
biodranik (Migrated from github.com) commented 2023-08-16 22:18:52 +00:00

Do you really need to store runs.html on disk and then clean it up?

Do you really need to store runs.html on disk and then clean it up?
@ -0,0 +118,4 @@
LANGUAGES=$(jq -r '(.sections_to_remove | keys | .[])' article_processing_config.json)
fi
# shellcheck disable=SC2086 # LANGUAGES is intentionally expanded.
log "Selected languages:" $LANGUAGES
biodranik (Migrated from github.com) commented 2023-08-16 22:17:42 +00:00

nit: Can array be used here without a warning?

nit: Can array be used here without a warning?
newsch (Migrated from github.com) reviewed 2023-08-16 23:38:52 +00:00
newsch (Migrated from github.com) commented 2023-08-16 23:38:52 +00:00

Good point, I had it like that for POSIX sh because there's no pipefail. With bash it shouldn't be a problem.

Good point, I had it like that for POSIX sh because there's no pipefail. With bash it shouldn't be a problem.
newsch (Migrated from github.com) reviewed 2023-08-17 00:05:40 +00:00
@ -0,0 +118,4 @@
LANGUAGES=$(jq -r '(.sections_to_remove | keys | .[])' article_processing_config.json)
fi
# shellcheck disable=SC2086 # LANGUAGES is intentionally expanded.
log "Selected languages:" $LANGUAGES
newsch (Migrated from github.com) commented 2023-08-17 00:05:40 +00:00

To convert it to an array with the same semantics it would need to suppress another warning:

# shellcheck disable=SC2206 # Intentionally split on whitespace.
LANGUAGES=( $LANGUAGES )
To convert it to an array with the same semantics it would need to suppress another warning: ``` # shellcheck disable=SC2206 # Intentionally split on whitespace. LANGUAGES=( $LANGUAGES ) ```
newsch commented 2023-08-17 14:48:31 +00:00 (Migrated from github.com)

I've looked a little into parallel downloads with programs in the Debian repos:

GNU parallel or GNU xargs works, but you lose the progress bar from wget and no indication of how the downloads are doing:

for url in $URLS; do echo "$url"; done | xargs -L 1 -P 3 wget --no-verbose --continue --directory-prefix "$DOWNLOAD_DIR"

aria2c returned protocol errors:

aria2c -x2 -s2 -c -d ~/Downloads/aria-test \
    https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/dewiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz \
    https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz

08/17 10:35:22 [NOTICE] Downloading 1 item(s)

08/17 10:35:23 [ERROR] CUID#9 - Download aborted. URI=https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz
Exception: [AbstractCommand.cc:351] errorCode=8 URI=https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz
  -> [HttpResponse.cc:81] errorCode=8 Invalid range header. Request: 130154496-14855176191/29631479446, Response: 130154496-14855176191/25080519092

axel only seems to parallelize a single download

wget2 works great out of the box:

wget2 --progress=bar --continue --directory-prefix "$DOWNLOAD_DIR" $URLS
I've looked a little into parallel downloads with programs in the Debian repos: GNU parallel or GNU xargs works, but you lose the progress bar from wget and no indication of how the downloads are doing: ``` for url in $URLS; do echo "$url"; done | xargs -L 1 -P 3 wget --no-verbose --continue --directory-prefix "$DOWNLOAD_DIR" ``` `aria2c` returned protocol errors: ``` aria2c -x2 -s2 -c -d ~/Downloads/aria-test \ https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/dewiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz \ https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz 08/17 10:35:22 [NOTICE] Downloading 1 item(s) 08/17 10:35:23 [ERROR] CUID#9 - Download aborted. URI=https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz Exception: [AbstractCommand.cc:351] errorCode=8 URI=https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz -> [HttpResponse.cc:81] errorCode=8 Invalid range header. Request: 130154496-14855176191/29631479446, Response: 130154496-14855176191/25080519092 ``` `axel` only seems to parallelize a single download `wget2` works great out of the box: ``` wget2 --progress=bar --continue --directory-prefix "$DOWNLOAD_DIR" $URLS ```
newsch (Migrated from github.com) reviewed 2023-08-17 14:58:48 +00:00
@ -0,0 +141,4 @@
done
if [ -z "$URLS" ]; then
log "No dumps available"
newsch (Migrated from github.com) commented 2023-08-17 14:58:47 +00:00

They shouldn't need to be.

The temporary files are regenerated each time.
The generated HTML will be overwritten if it is referenced in the new planet file.

If an article isn't extracted from the dump due to #24 or something else, then having the old copy still available might be useful.

But if the HTML simplification is changed, and older articles are no longer referenced in OSM, then they will remain on disk unchanged.

They shouldn't _need_ to be. The temporary files are regenerated each time. The generated HTML will be overwritten if it is referenced in the new planet file. If an article isn't extracted from the dump due to #24 or something else, then having the old copy still available might be useful. But if the HTML simplification is changed, and older articles are no longer referenced in OSM, then they will remain on disk unchanged.
newsch (Migrated from github.com) reviewed 2023-08-17 15:02:10 +00:00
newsch (Migrated from github.com) commented 2023-08-17 15:02:10 +00:00

Do you want the script to handle this?

If it will be running on a cron job, then it might be good to keep 2 copies around.
Otherwise the script could delete the last dump as wikiparser is using it?

Do you want the script to handle this? If it will be running on a cron job, then it might be good to keep 2 copies around. Otherwise the script could delete the last dump as wikiparser is using it?
biodranik (Migrated from github.com) reviewed 2023-08-17 23:46:19 +00:00
biodranik (Migrated from github.com) commented 2023-08-17 23:46:19 +00:00
  1. Aren't files that were open before their deletion on Linux still accessible?
  2. Dumps are produced regularly, right? We can set a specific schedule.
  3. Script may have an option to automatically delete older dumps.
1. Aren't files that were open before their deletion on Linux still accessible? 2. Dumps are produced regularly, right? We can set a specific schedule. 3. Script may have an option to automatically delete older dumps.
biodranik commented 2023-08-17 23:48:45 +00:00 (Migrated from github.com)

wget2 works great out of the box:

The default behavior can be like this: use wget2 if it's available, and fall back to a single-threaded download while mentioning a speedup with wget2.

Another important question is if it's ok to overload wiki servers with parallel downloads. Can you please ask them to confirm? Maybe they have a single-threaded policy?

> wget2 works great out of the box: The default behavior can be like this: use wget2 if it's available, and fall back to a single-threaded download while mentioning a speedup with wget2. Another important question is if it's ok to overload wiki servers with parallel downloads. Can you please ask them to confirm? Maybe they have a single-threaded policy?
newsch (Migrated from github.com) reviewed 2023-08-18 17:06:30 +00:00
newsch (Migrated from github.com) commented 2023-08-18 17:06:30 +00:00
  1. Aren't files that were open before their deletion on Linux still accessible?

You're right, as long as run.sh is started before download.sh deletes them, it will be able to access the files.

  1. Dumps are produced regularly, right? We can set a specific schedule.

Yes, they're started on the 1st and the 20th of each month, and finished within 3 days it looks like.

  1. Script may have an option to automatically delete older dumps.

👍

> 1. Aren't files that were open before their deletion on Linux still accessible? You're right, as long as `run.sh` is started before `download.sh` deletes them, it will be able to access the files. > 2. Dumps are produced regularly, right? We can set a specific schedule. Yes, they're started on the 1st and the 20th of each month, and finished within 3 days it looks like. > 3. Script may have an option to automatically delete older dumps. :+1:
newsch commented 2023-08-18 17:18:41 +00:00 (Migrated from github.com)

Looks like 2 parallel downloads is the max:

If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2. This will help to ensure that everyone can access the files with reasonable download times. Clients that try to evade these limits may be blocked.

There are at least two mirrors that host some of the latest enterprise dumps:

Looks like 2 parallel downloads is the max: > If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2. This will help to ensure that everyone can access the files with reasonable download times. Clients that try to evade these limits may be blocked. There are at least two [mirrors](https://dumps.wikimedia.org/mirrors.html) that host some of the latest enterprise dumps: - (US) https://dumps.wikimedia.your.org/other/enterprise_html/runs/ - (Sweden) https://mirror.accum.se/mirror/wikimedia.org/other/enterprise_html/runs/
newsch (Migrated from github.com) reviewed 2023-08-18 18:28:29 +00:00
newsch (Migrated from github.com) commented 2023-08-18 18:28:28 +00:00

I've added a new option:

-D      Delete all old dump subdirectories if the latest is downloaded
I've added a new option: ``` -D Delete all old dump subdirectories if the latest is downloaded ```
biodranik commented 2023-08-18 18:36:00 +00:00 (Migrated from github.com)

Good, let's track how fast mirrors are updated. We may hardcode or put into readme links to URLs/mirrors and use what is better/faster.

Good, let's track how fast mirrors are updated. We may hardcode or put into readme links to URLs/mirrors and use what is better/faster.
newsch commented 2023-08-21 20:32:01 +00:00 (Migrated from github.com)

Both of the mirrors have the 2023-08-20 dumps up already.

Both of the mirrors have the 2023-08-20 dumps up already.
biodranik (Migrated from github.com) reviewed 2023-08-21 20:55:57 +00:00
biodranik (Migrated from github.com) left a comment

Thanks!

  1. wget2 doesn't resume interrupted downloads.
  2. Don't forget to squash all commits before the merge )
Thanks! 1. wget2 doesn't resume interrupted downloads. 2. Don't forget to squash all commits before the merge )
biodranik (Migrated from github.com) commented 2023-08-21 20:53:18 +00:00

-c 1, -c 2 and no option behave in the same way with wget2 installed.

`-c 1`, `-c 2` and no option behave in the same way with wget2 installed.
@ -0,0 +5,4 @@
Arguments:
<DUMP_DIR> An existing directory to store dumps in. Dumps will be grouped
into subdirectories by date, and a link 'latest' will point to
biodranik (Migrated from github.com) commented 2023-08-21 20:54:51 +00:00

Will wikiparser generator properly find/load newer versions from the latest dir without specifying explicit file names?

Will wikiparser generator properly find/load newer versions from the latest dir without specifying explicit file names?
newsch (Migrated from github.com) reviewed 2023-08-21 21:08:15 +00:00
@ -0,0 +5,4 @@
Arguments:
<DUMP_DIR> An existing directory to store dumps in. Dumps will be grouped
into subdirectories by date, and a link 'latest' will point to
newsch (Migrated from github.com) commented 2023-08-21 21:08:15 +00:00

For the run.sh script, you'll provide a glob of the latest directory:

./run.sh descriptions/ planet.osm.pdf $DUMP_DIR/latest/*

It doesn't have any special handling for the $DUMP_DIR layout.

For the `run.sh` script, you'll provide a glob of the latest directory: ``` ./run.sh descriptions/ planet.osm.pdf $DUMP_DIR/latest/* ``` It doesn't have any special handling for the `$DUMP_DIR` layout.
newsch (Migrated from github.com) reviewed 2023-08-21 21:09:42 +00:00
newsch (Migrated from github.com) commented 2023-08-21 21:09:41 +00:00

Correct, I'll clarify that.

Correct, I'll clarify that.
newsch commented 2023-08-21 21:14:38 +00:00 (Migrated from github.com)

wget2 doesn't resume interrupted downloads

What kind of interruption? It should be able to handle network drops and temporary errors.

> wget2 doesn't resume interrupted downloads What kind of interruption? It should be able to handle network drops and temporary errors.
biodranik (Migrated from github.com) approved these changes 2023-08-29 21:54:51 +00:00
@ -0,0 +77,4 @@
echo "$USAGE" | head -n1 >&2
exit 1
fi
biodranik (Migrated from github.com) commented 2023-08-29 21:54:31 +00:00

Can spaces be added here?

Can spaces be added here?
biodranik commented 2023-08-29 21:55:56 +00:00 (Migrated from github.com)

It would be great to test all these PRs on the server with a real data.

It would be great to test all these PRs on the server with a real data.
newsch (Migrated from github.com) reviewed 2023-09-01 16:02:39 +00:00
@ -0,0 +77,4 @@
echo "$USAGE" | head -n1 >&2
exit 1
fi
newsch (Migrated from github.com) commented 2023-09-01 16:02:39 +00:00

I haven't seen an example with spaces in the name. All of the browser user agents use CamelCase instead of spaces.

I haven't seen an example with spaces in the name. All of the browser user agents use CamelCase instead of spaces.
Sign in to join this conversation.
No description provided.