Add script for running with map generator #21
No reviewers
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/wikiparser#21
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "add-script"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #17.
Remaining work:
@ -0,0 +1,32 @@
use std::process::Command;
/// Pass git-describe through CARGO_GIT_VERSION env variable
Why is it needed?
Checking pipe failures helps.
If -x echo doesn't hurt, then it can be always used, for better understanding what magic is going under the hood.
Is it a copy-paste between scripts that can be reused? :)
Can echo be used here?
Why is it better than echo?
P.S. This and the previous comment are also related to another script.
nit: here and in other places
Why colon is needed? What is the purpose of this line?
ditto: why printf is better than echo?
It would be great to clarify why the latest map build is needed at all.
Am I correctly understanding the issue with the current approach?
pipefail
is unique to bash, I wrote this as a posix sh script. Happy to switch if you'd rather use bash.Yes, should I put this in a third file and
source
it?I used printf because echo doesn't handle escaped characters like
\t
and\n
in a portable way, and you can format numbers and other things nicely.It sets
MAPS_BUILD_ROOT
to~/maps_build
if it doesn't exist already, the colon is a builtin no-op so that the expansion is evaluated but not used.It could be replaced with:
To replace the workflow with the scraper:
So if you'd like to do it in one run, we could update the generator to call wikiparser.
Or we could tweak the generator to only output the descriptions and continue so you can run wikiparser out-of-band.
Bash is the default shell used in many companies. It is a good practice to write bash scripts and use bash features.
#!/usr/bin/env bash
Yes, there is nothing wrong with that approach to avoid copy-paste. Imagine you'll introduce a third script, or split your current one into parts.
FOO="${FOO:-default value}"
work here?Updating the generator to call wikiparser means that generator should wait until wikiparser finishes, right? This approach may be a good temporary start, but we aim to speed up the map generation process as much as possible. That's why the ideal solution would (likely) be to start generator and wikiparser in parallel, as soon as a new osm planet dump is available (or maybe start wikiparser before the generator). So when the generator needs descriptions they will already be available.
That's why it was important to focus on speedy articles extraction/processing from the start.
WDYT?
@ -0,0 +1,32 @@
use std::process::Command;
/// Pass git-describe through CARGO_GIT_VERSION env variable
I've found it useful in the past to embed the git commit in the binary so that when I'm looking through logs I can tell what version was running.
I can remove it if you don't think it's useful.
Yes,
:-
should workYes, it replaces the blocking scraper script in it's current form.
I thought it would be better to start with this working and then separate and speed up the process.
To separate fully from the generator we need to finish #19.
In the meantime we could also use the outputs from an old map build and run the wikiparser ahead of time/in parallel.
In other places I've heard printf recommended over echo, but if we're using bash explicitly then the portability concerns are not relevant.
@biodranik do you want to use gnu
parallel
or something else here instead of a serial for loop?Maybe ideal workflow would be to start wikiparser in parallel with the map generation and make generator aware of when wikiparser finishes (assuming that it finishes faster than generator requires wiki articles).
If wikiparser takes a lot of time, then it's better to run it out of band in advance, considering that its data is rarely updated anyway.
Won't using & and
wait
at the end be enough?Should I try these changes on a server first?
As long as the machine it's running on has enough cores, fine for the maps server but it will bog down on my laptop. I'll use
&
for now.You may add an option to use only one core.