Split downloads across mirrors #27

Open
opened 2023-08-22 14:28:08 +00:00 by newsch · 2 comments
newsch commented 2023-08-22 14:28:08 +00:00 (Migrated from github.com)

As discussed in #22, Wikipedia has a limit of 2 concurrent connections and seems to rate limit each to about 4 MB/s. There are at least two mirrors of the Enterprise dumps.
For the fastest speeds, ideally we could share downloads between wikipedia and the mirrors, or even download different parts of the same file concurrently like aria2c.

Unfortunately, none of the parallel downloaders I've seen allow setting connection limits per host (e.g. 2 for dumps.wikimedia.org, 4 for the rest).

So besides writing our own downloader, to respect the wikimedia limits we could:

  • Keep the 2 threads limit and divide the files across the available hosts
  • Increase the 2 threads limit and only use dumps.wikimedia.org for two files
  • Increase the 2 threads limit and don't use dumps.wikimedia.org for any files
As discussed in [#22](https://github.com/organicmaps/wikiparser/pull/22#issuecomment-1684211200), Wikipedia has a limit of 2 concurrent connections and seems to rate limit each to about 4 MB/s. There are at least two mirrors of the Enterprise dumps. For the fastest speeds, ideally we could share downloads between wikipedia and the mirrors, or even download different parts of the same file concurrently like `aria2c`. Unfortunately, none of the parallel downloaders I've seen allow setting connection limits per host (e.g. 2 for dumps.wikimedia.org, 4 for the rest). So besides writing our own downloader, to respect the wikimedia limits we could: - Keep the 2 threads limit and divide the files across the available hosts - Increase the 2 threads limit and only use dumps.wikimedia.org for two files - Increase the 2 threads limit and don't use dumps.wikimedia.org for any files
biodranik commented 2023-09-01 06:09:19 +00:00 (Migrated from github.com)

What is the simplest solution?

What is the simplest solution?
newsch commented 2023-09-01 17:02:37 +00:00 (Migrated from github.com)

The simplest is to only use a single host.
Beyond that, I think the second option would provide the best throughput increase and still be relatively straightforward.

The simplest is to only use a single host. Beyond that, I think the second option would provide the best throughput increase and still be relatively straightforward.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: organicmaps/wikiparser#27
No description provided.