Generate Wikipedia articles from offline Wikipedia dump #3478

Closed
opened 2022-09-24 18:05:54 +00:00 by biodranik · 5 comments
biodranik commented 2022-09-24 18:05:54 +00:00 (Migrated from github.com)

There are several issues now with our current crawler implementation:

  1. It is banned by Wiki servers because it DDoSes the API.
  2. It takes a lot of time to get all summaries for several supported languages.
  3. If we add more languages, the time increases even more.
  4. As a result, now we are still using some old outdated articles we got in this spring.

A way better option is to download a dump of all Wiki articles for given languages and extract summaries from there directly.

Any volunteers for this task?

A list of supported languages and the output format can be checked in the existing implementation in tools/python/descriptions.

There are several issues now with our current crawler implementation: 1. It is banned by Wiki servers because it DDoSes the API. 2. It takes a lot of time to get all summaries for several supported languages. 3. If we add more languages, the time increases even more. 4. As a result, now we are still using some old outdated articles we got in this spring. A way better option is to download a dump of all Wiki articles for given languages and extract summaries from there directly. Any volunteers for this task? A list of supported languages and the output format can be checked in the existing implementation in `tools/python/descriptions`.
az09 commented 2023-01-06 18:43:18 +00:00 (Migrated from github.com)

Can I help? ^_^

Can I help? ^_^
biodranik commented 2023-01-06 19:06:23 +00:00 (Migrated from github.com)

Some ideas about the workflow:

  1. A separate script to update the wiki dumps to the latest version (looks like each language should be downloaded separately).
  2. A separate script/tool to process the dump and extract articles only in required/supported languages.
  3. pbzip2 (parallel bzip2) may be used to decompress/process dump faster.
  4. A fast stream XML parser should be used to avoid unnecessary bz2 archives decompression.
  5. Optionally, some timestamp check logic can be implemented to detect if the article was modified to avoid unnecessary file operations. Otherwise, a previous dump should be completely removed and regenerated again.
  6. (Bonus for the future) Need to check if we can leave HTTP links to uploaded images, so later articles can optionally load/show images if there is an internet connection, and user has allowed it.
Some ideas about the workflow: 1. A separate script to update the wiki dumps to the latest version (looks like each language should be [downloaded separately](https://dumps.wikimedia.org/backup-index.html)). 2. A separate script/tool to process the dump and extract articles only in required/supported languages. 3. `pbzip2` (parallel bzip2) may be used to decompress/process dump faster. 4. A fast stream XML parser should be used to avoid unnecessary bz2 archives decompression. 5. Optionally, some timestamp check logic can be implemented to detect if the article was modified to avoid unnecessary file operations. Otherwise, a previous dump should be completely removed and regenerated again. 6. (Bonus for the future) Need to check if we can leave HTTP links to uploaded images, so later articles can optionally load/show images if there is an internet connection, and user has allowed it.
Member

Linking #2410

Linking #2410
euf commented 2023-01-10 10:37:58 +00:00 (Migrated from github.com)

Since current workflow is in Python, a few libraries to consider:

Since current workflow is in Python, a few libraries to consider: - https://github.com/5j9/wikitextparser - https://github.com/earwig/mwparserfromhell
biodranik commented 2023-01-10 16:47:41 +00:00 (Migrated from github.com)

Python is not the fastest tool to quickly and efficiently process large bz2-ed dumps. There are faster tools/languages.

Python is not the _fastest_ tool to quickly and efficiently process large bz2-ed dumps. There are faster tools/languages.
This repo is archived. You cannot comment on issues.
No labels
Accessibility
Accessibility
Address
Address
Android
Android
Android Auto
Android Auto
Android Automotive (AAOS)
Android Automotive (AAOS)
API
API
AppGallery
AppGallery
AppStore
AppStore
Battery and Performance
Battery and Performance
Blocker
Blocker
Bookmarks and Tracks
Bookmarks and Tracks
Borders
Borders
Bug
Bug
Build
Build
CarPlay
CarPlay
Classificator
Classificator
Community
Community
Core
Core
CrashReports
CrashReports
Cycling
Cycling
Desktop
Desktop
DevEx
DevEx
DevOps
DevOps
dev_sandbox
dev_sandbox
Directions
Directions
Documentation
Documentation
Downloader
Downloader
Drape
Drape
Driving
Driving
Duplicate
Duplicate
Editor
Editor
Elevation
Elevation
Enhancement
Enhancement
Epic
Epic
External Map Datasets
External Map Datasets
F-Droid
F-Droid
Fonts
Fonts
Frequently User Reported
Frequently User Reported
Fund
Fund
Generator
Generator
Good first issue
Good first issue
Google Play
Google Play
GPS
GPS
GSoC
GSoC
iCloud
iCloud
Icons
Icons
iOS
iOS
Legal
Legal
Linux Desktop
Linux Desktop
Linux packaging
Linux packaging
Linux Phone
Linux Phone
Mac OS
Mac OS
Map Data
Map Data
Metro
Metro
Navigation
Navigation
Need Feedback
Need Feedback
Night Mode
Night Mode
NLnet 2024-06-281
NLnet 2024-06-281
No Feature Parity
No Feature Parity
Opening Hours
Opening Hours
Outdoors
Outdoors
POI Info
POI Info
Privacy
Privacy
Public Transport
Public Transport
Raw Idea
Raw Idea
Refactoring
Refactoring
Regional
Regional
Regression
Regression
Releases
Releases
RoboTest
RoboTest
Route Planning
Route Planning
Routing
Routing
Ruler
Ruler
Search
Search
Security
Security
Styles
Styles
Tests
Tests
Track Recording
Track Recording
Translations
Translations
TTS
TTS
UI
UI
UX
UX
Walk Navigation
Walk Navigation
Watches
Watches
Web
Web
Wikipedia
Wikipedia
Windows
Windows
Won't fix
Won't fix
World Map
World Map
No milestone
No project
No assignees
2 participants
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: organicmaps/organicmaps-tmp#3478
No description provided.