Skip articles that haven't changed between dumps #9

Open
opened 2023-06-26 16:13:39 +00:00 by newsch · 2 comments
newsch commented 2023-06-26 16:13:39 +00:00 (Migrated from github.com)

The dump schema includes a date_modified timestamp and other revision metadata.

To reduce disk I/O, we could store some metadata along the articles, compare it against the new one when processing, and skip them if they haven't changed.

One way to do this would be to store the date_modified timestamp as the modified attribute of the article file.

The [dump schema](https://enterprise.wikimedia.com/docs/data-dictionary/) includes a `date_modified` timestamp and other revision metadata. To reduce disk I/O, we could store some metadata along the articles, compare it against the new one when processing, and skip them if they haven't changed. One way to do this would be to store the `date_modified` timestamp as the `modified` attribute of the article file.
biodranik commented 2023-06-26 16:58:10 +00:00 (Migrated from github.com)

An interesting optimization, but it may not worth it. Need to prove its benefits first. Let's leave it in a very low priority for now.

An interesting optimization, but it may not worth it. Need to prove its benefits first. Let's leave it in a very low priority for now.
newsch commented 2023-06-26 19:09:24 +00:00 (Migrated from github.com)

Understood, I've been thinking of it since you mentioned it here, we'll see what the profiling shows for the workflow.

Understood, I've been thinking of it since you [mentioned it here](https://github.com/organicmaps/organicmaps/issues/3478#issuecomment-1374018687), we'll see what the profiling shows for the workflow.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: organicmaps/wikiparser#9
No description provided.