Download script #22
No reviewers
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/wikiparser#22
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "download"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #12
Remaining work:
In case no new dumps are available, it should just make sure that the latest ones are already downloaded and exit gracefully (and print that).
set -euxo pipefail
is helpful if decide to use pipes in the script.nit: fewer lines of code are easier to read.
nit: (here and below)
TMPDIR?
get_wiki_dump.sh: line 11: 1: unbound variable
@ -0,0 +141,4 @@
done
if [ -z "$URLS" ]; then
log "No dumps available"
"Latest dumps are already downloaded"?
@ -0,0 +141,4 @@
done
if [ -z "$URLS" ]; then
log "No dumps available"
If URLS is empty, then none of the specified languages could be found for the latest dump.
If a newer dump isn't available, it will still check the sizes of the last downloaded dump, and exit with 0.
@ -0,0 +141,4 @@
done
if [ -z "$URLS" ]; then
log "No dumps available"
Good! The goal is to make a cron script that will update files automatically when they are published (and delete old files).
Another question: should previously generated HTML and other temporary files be deleted before relaunching the wikiparser? Does it make sense to cover it in the run script?
Need to test it on a server )
Do you really need to store runs.html on disk and then clean it up?
@ -0,0 +118,4 @@
LANGUAGES=$(jq -r '(.sections_to_remove | keys | .[])' article_processing_config.json)
fi
# shellcheck disable=SC2086 # LANGUAGES is intentionally expanded.
log "Selected languages:" $LANGUAGES
nit: Can array be used here without a warning?
Good point, I had it like that for POSIX sh because there's no pipefail. With bash it shouldn't be a problem.
@ -0,0 +118,4 @@
LANGUAGES=$(jq -r '(.sections_to_remove | keys | .[])' article_processing_config.json)
fi
# shellcheck disable=SC2086 # LANGUAGES is intentionally expanded.
log "Selected languages:" $LANGUAGES
To convert it to an array with the same semantics it would need to suppress another warning:
I've looked a little into parallel downloads with programs in the Debian repos:
GNU parallel or GNU xargs works, but you lose the progress bar from wget and no indication of how the downloads are doing:
aria2c
returned protocol errors:axel
only seems to parallelize a single downloadwget2
works great out of the box:@ -0,0 +141,4 @@
done
if [ -z "$URLS" ]; then
log "No dumps available"
They shouldn't need to be.
The temporary files are regenerated each time.
The generated HTML will be overwritten if it is referenced in the new planet file.
If an article isn't extracted from the dump due to #24 or something else, then having the old copy still available might be useful.
But if the HTML simplification is changed, and older articles are no longer referenced in OSM, then they will remain on disk unchanged.
Do you want the script to handle this?
If it will be running on a cron job, then it might be good to keep 2 copies around.
Otherwise the script could delete the last dump as wikiparser is using it?
The default behavior can be like this: use wget2 if it's available, and fall back to a single-threaded download while mentioning a speedup with wget2.
Another important question is if it's ok to overload wiki servers with parallel downloads. Can you please ask them to confirm? Maybe they have a single-threaded policy?
You're right, as long as
run.sh
is started beforedownload.sh
deletes them, it will be able to access the files.Yes, they're started on the 1st and the 20th of each month, and finished within 3 days it looks like.
👍
Looks like 2 parallel downloads is the max:
There are at least two mirrors that host some of the latest enterprise dumps:
I've added a new option:
Good, let's track how fast mirrors are updated. We may hardcode or put into readme links to URLs/mirrors and use what is better/faster.
Both of the mirrors have the 2023-08-20 dumps up already.
Thanks!
-c 1
,-c 2
and no option behave in the same way with wget2 installed.@ -0,0 +5,4 @@
Arguments:
<DUMP_DIR> An existing directory to store dumps in. Dumps will be grouped
into subdirectories by date, and a link 'latest' will point to
Will wikiparser generator properly find/load newer versions from the latest dir without specifying explicit file names?
@ -0,0 +5,4 @@
Arguments:
<DUMP_DIR> An existing directory to store dumps in. Dumps will be grouped
into subdirectories by date, and a link 'latest' will point to
For the
run.sh
script, you'll provide a glob of the latest directory:It doesn't have any special handling for the
$DUMP_DIR
layout.Correct, I'll clarify that.
What kind of interruption? It should be able to handle network drops and temporary errors.
@ -0,0 +77,4 @@
echo "$USAGE" | head -n1 >&2
exit 1
fi
Can spaces be added here?
It would be great to test all these PRs on the server with a real data.
@ -0,0 +77,4 @@
echo "$USAGE" | head -n1 >&2
exit 1
fi
I haven't seen an example with spaces in the name. All of the browser user agents use CamelCase instead of spaces.