Download Wikipedia articles' summaries only #2410

Closed
pastk wants to merge 2 commits from pastk-scripts into master
Member

"Summary" is an article's part before any sections like "History", "See also", etc.

"Summary" is an article's part before any sections like "History", "See also", etc.
vng (Migrated from github.com) reviewed 2022-04-16 19:42:08 +00:00
pastk reviewed 2022-04-16 19:43:11 +00:00
@ -140,3 +140,3 @@
x.extract()
soup = remove_bad_sections(soup, lang)
# soup = remove_bad_sections(soup, lang)
html = str(soup.prettify())
Author
Member

shall I remove all unused code or keep it commented in case we need it again?

shall I remove all unused code or keep it commented in case we need it again?
biodranik (Migrated from github.com) reviewed 2022-04-16 21:36:47 +00:00
biodranik (Migrated from github.com) left a comment

Если запустить скрипт на текущих данных в таком виде, сколько он накачает гигабайт статей?

Если запустить скрипт на текущих данных в таком виде, сколько он накачает гигабайт статей?
biodranik (Migrated from github.com) commented 2022-04-16 20:01:10 +00:00

Для чего используется версия в этом скрипте?

Для чего используется версия в этом скрипте?
@ -140,3 +140,3 @@
x.extract()
soup = remove_bad_sections(soup, lang)
# soup = remove_bad_sections(soup, lang)
html = str(soup.prettify())
biodranik (Migrated from github.com) commented 2022-04-16 21:33:02 +00:00

Зачем выпиливать все секции? Я вот читаю статьи, вроде реально всё интересно и нужно. Не лучше ли явно выпиливать только что-то нерелевантное? Какую проблему-то решаем?

Зачем выпиливать все секции? Я вот читаю статьи, вроде реально всё интересно и нужно. Не лучше ли явно выпиливать только что-то нерелевантное? Какую проблему-то решаем?
biodranik (Migrated from github.com) commented 2022-04-16 21:36:07 +00:00

Разве во всех статьях есть summary секция?

Разве во всех статьях есть summary секция?
pastk reviewed 2022-04-17 09:04:37 +00:00
Author
Member

It should be there.
"A simple article should have, at least, (a) a lead section and (b) references." (https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Layout#Order_of_article_elements)

If there are no headings/sections in the article then a whole article will be returned by the "summary" call.

It should be there. "A simple article should have, at least, (a) a lead section and (b) references." (https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Layout#Order_of_article_elements) If there are no headings/sections in the article then a whole article will be returned by the "summary" call.
pastk reviewed 2022-04-17 09:05:45 +00:00
Author
Member

To install it as a python package.

To install it as a python package.
Author
Member

Если запустить скрипт на текущих данных в таком виде, сколько он накачает гигабайт статей?

I have no means to check it, sorry.
It should be possible to estimate by processing an archive of already downloaded articles.

> Если запустить скрипт на текущих данных в таком виде, сколько он накачает гигабайт статей? I have no means to check it, sorry. It should be possible to estimate by processing an archive of already downloaded articles.
pastk reviewed 2022-04-17 09:21:36 +00:00
@ -140,3 +140,3 @@
x.extract()
soup = remove_bad_sections(soup, lang)
# soup = remove_bad_sections(soup, lang)
html = str(soup.prettify())
Author
Member

It leads to a significant maps size inflation (and its enabled for 5 languages only so far!), though added value for the most users is small.
I doubt many users will read full wikipedia articles in a map app, its not an offline wikipedia reader after all.

Many more users will read short descriptions of the POIs though (its kind of replacement of missing OSM descriptions) and given they take much less space the relative added value will be much higher.

It leads to a significant maps size inflation (and its enabled for 5 languages only so far!), though added value for the most users is small. I doubt many users will read **full** wikipedia articles in a map app, its not an offline wikipedia reader after all. Many more users will read short descriptions of the POIs though (its kind of replacement of missing OSM descriptions) and given they take much less space the relative added value will be much higher.
biodranik (Migrated from github.com) reviewed 2022-04-17 10:51:29 +00:00
@ -140,3 +140,3 @@
x.extract()
soup = remove_bad_sections(soup, lang)
# soup = remove_bad_sections(soup, lang)
html = str(soup.prettify())
biodranik (Migrated from github.com) commented 2022-04-17 10:51:29 +00:00

All articles I saw on the map were already stripped and contained only the summary or summary plus one or two other interesting and useful sections. That's why I'm asking, how much and what exactly can we save with this patch.

If you have some "bad" examples that are too large and can be easily stripped, because sections are "unnecessary", let's check them and maybe add these sections to exceptions.

All articles I saw on the map were already stripped and contained only the summary or summary plus one or two other interesting and useful sections. That's why I'm asking, how much and what exactly can we save with this patch. If you have some "bad" examples that are too large and can be easily stripped, because sections are "unnecessary", let's check them and maybe add these sections to exceptions.
pastk reviewed 2022-04-17 11:43:29 +00:00
@ -140,3 +140,3 @@
x.extract()
soup = remove_bad_sections(soup, lang)
# soup = remove_bad_sections(soup, lang)
html = str(soup.prettify())
Author
Member

Check any capital or a significant sightseeing, e.g.
https://en.wikipedia.org/wiki/Eiffel_Tower
https://en.wikipedia.org/wiki/Tbilisi

Section names are non-standard, so it would be impossible to gain big savings by such filtering, e.g. I just checked some articles in a smaller russian city and found many sections of very little (or very specific) interest:
https://ru.wikipedia.org/wiki/Марийский_государственный_университет
https://ru.wikipedia.org/wiki/Йошкар-Олинская_ТЭЦ-2
https://ru.wikipedia.org/wiki/Йошкар-олинский_троллейбус

Check any capital or a significant sightseeing, e.g. https://en.wikipedia.org/wiki/Eiffel_Tower https://en.wikipedia.org/wiki/Tbilisi Section names are non-standard, so it would be impossible to gain big savings by such filtering, e.g. I just checked some articles in a smaller russian city and found many sections of very little (or very specific) interest: https://ru.wikipedia.org/wiki/Марийский_государственный_университет https://ru.wikipedia.org/wiki/Йошкар-Олинская_ТЭЦ-2 https://ru.wikipedia.org/wiki/Йошкар-олинский_троллейбус
biodranik (Migrated from github.com) reviewed 2022-04-17 11:57:18 +00:00
@ -140,3 +140,3 @@
x.extract()
soup = remove_bad_sections(soup, lang)
# soup = remove_bad_sections(soup, lang)
html = str(soup.prettify())
biodranik (Migrated from github.com) commented 2022-04-17 11:57:18 +00:00

Checked in the app. I see only a few relevant sections in all examples, except cities. We don't store wiki articles in the world map now, so it's impossible to check their sections.

Checked in the app. I see only a few relevant sections in all examples, except cities. We don't store wiki articles in the world map now, so it's impossible to check their sections.
Author
Member

Some examples of map size inflation:
Paris - from 18MB to 65MB
Moscow - from 51MB to 92MB
Buenos-Aires - from 39MB to 70MB
Mexico City - 148MB to 180MB

The big point is:
Users download and keep map files for the sake of the map data, not for the auxiliary function of wikipedia offline reading!
Especially when they can't understand many languages wiki articles are stored in - it just becomes waste of space and bandwidth. IMHO compact map files is one of the notable advantages of OM.

So any significant map size increase should be considered very carefully weighing in the value most of the users will get from it.

I think some questions were not answered before rolling out of the wikipedia feature:

  • what will be a total size increase if we include all OM languages? (now its limited to just 5)
  • what will be a size increase for touristic centers (which are downloaded the most often) if we include all languages
  • size increases if we include all languages but article summaries only?
  • if we include english summaries only?

It might happen that sizes will be significantly inflated even if we limit to summaries only but include all OM languages.
I don't think it'll be fair to keep the feature limited to a few languages only like it is now - in this case users who don't understand these languages will be at big disadvantage (no value added for them, but need to cope with bigger files anyways) - and this is users from mostly third-world countries who don't posses modern devices with lots of storage and cheap traffic.

Some examples of map size inflation: Paris - from 18MB to 65MB Moscow - from 51MB to 92MB Buenos-Aires - from 39MB to 70MB Mexico City - 148MB to 180MB The big point is: Users download and keep map files for the sake of the map data, not for the auxiliary function of wikipedia offline reading! Especially when they can't understand many languages wiki articles are stored in - it just becomes waste of space and bandwidth. IMHO compact map files is one of the notable advantages of OM. So any significant map size increase should be considered very carefully weighing in the value most of the users will get from it. I think some questions were not answered before rolling out of the wikipedia feature: - what will be a total size increase if we include all OM languages? (now its limited to just 5) - what will be a size increase for touristic centers (which are downloaded the most often) if we include all languages - size increases if we include all languages but article summaries only? - if we include english summaries only? It might happen that sizes will be significantly inflated even if we limit to summaries only but include all OM languages. I don't think it'll be fair to keep the feature limited to a few languages only like it is now - in this case users who don't understand these languages will be at big disadvantage (no value added for them, but need to cope with bigger files anyways) - and this is users from mostly third-world countries who don't posses modern devices with lots of storage and cheap traffic.
pastk reviewed 2022-04-17 12:22:25 +00:00
@ -140,3 +140,3 @@
x.extract()
soup = remove_bad_sections(soup, lang)
# soup = remove_bad_sections(soup, lang)
html = str(soup.prettify())
Author
Member

I dunno, what's useful in having sections with lists of universities' buildings and faculties or technical specs of the machinery used at electrical station...
I don't think its worth spending storage space and bandwidth for this stuff of special interest.

I dunno, what's useful in having sections with lists of universities' buildings and faculties or technical specs of the machinery used at electrical station... I don't think its worth spending storage space and bandwidth for this stuff of special interest.
biodranik commented 2022-04-18 19:53:23 +00:00 (Migrated from github.com)

Предлагаю запустить этот патч после текущей генерации, чтобы проверить, какой получится размер загруженных descriptions, и сделать финальный вывод на реальных цифрах.
Кстати, текущий код зачем-то выкачивает много других языков (но в сумме на небольшие размеры), и мешает мобильные и немобильные версии в кеше на диске. Это выглядит странно.
@vng

Предлагаю запустить этот патч после текущей генерации, чтобы проверить, какой получится размер загруженных descriptions, и сделать финальный вывод на реальных цифрах. Кстати, текущий код зачем-то выкачивает много других языков (но в сумме на небольшие размеры), и мешает мобильные и немобильные версии в кеше на диске. Это выглядит странно. @vng
biodranik commented 2022-04-18 22:29:04 +00:00 (Migrated from github.com)

Поинты и вопросы валидные. Ценность вики сомнительная, разве что как доп. признак для популярности фич.
Конкуренты хранят слой вики статей отдельно и это довольно удобно для пользователей.

Поинты и вопросы валидные. Ценность вики сомнительная, разве что как доп. признак для популярности фич. Конкуренты хранят слой вики статей отдельно и это довольно удобно для пользователей.
biodranik commented 2022-04-18 22:29:46 +00:00 (Migrated from github.com)

Фишка ещё в том, что можно будет легко выпилить статьи в след. релизах.

Фишка ещё в том, что можно будет легко выпилить статьи в след. релизах.
Author
Member

Причём как доп. признак популярности фич можно и без офф-лайн статей использовать. Достаточно наличия тега wikipedia=

Причём как доп. признак популярности фич можно и без офф-лайн статей использовать. Достаточно наличия тега `wikipedia=`
biodranik commented 2022-04-19 17:33:45 +00:00 (Migrated from github.com)

Причём как доп. признак популярности фич можно и без офф-лайн статей использовать. Достаточно наличия тега wikipedia=

Объём статьи тоже вполне себе признак. Правда, фильтрация секций тут мешает :)

> Причём как доп. признак популярности фич можно и без офф-лайн статей использовать. Достаточно наличия тега `wikipedia=` Объём статьи тоже вполне себе признак. Правда, фильтрация секций тут мешает :)
Author
Member

Причём как доп. признак популярности фич можно и без офф-лайн статей использовать. Достаточно наличия тега wikipedia=

Объём статьи тоже вполне себе признак. Правда, фильтрация секций тут мешает :)

Википедия рекомендует, чтобы размер саммари соответствовал размеру статьи - у большой статьи допускается более длинное саммари :)

> > Причём как доп. признак популярности фич можно и без офф-лайн статей использовать. Достаточно наличия тега `wikipedia=` > > Объём статьи тоже вполне себе признак. Правда, фильтрация секций тут мешает :) Википедия рекомендует, чтобы размер саммари соответствовал размеру статьи - у большой статьи допускается более длинное саммари :)
biodranik commented 2022-04-25 05:55:30 +00:00 (Migrated from github.com)

@vng can you please try this commit before we merge it, so we can manually check how many descriptions were downloaded, and how good are they compared to the previous version?

@vng can you please try this commit before we merge it, so we can manually check how many descriptions were downloaded, and how good are they compared to the previous version?
Author
Member

I suggest we download full articles for all maps languages first and after that run some experiments/comparisons to determine the best approach. (Note: we need to save intermediate wiki url files to be able to tell which article belongs to which mwm. Ideally there should be a "main" feature tag too (or a "touristic" checker flag at least), but this is too much to ask I guess).

I suggest we download full articles for all maps languages first and after that run some experiments/comparisons to determine the best approach. (Note: we need to save intermediate wiki url files to be able to tell which article belongs to which mwm. Ideally there should be a "main" feature tag too (or a "touristic" checker flag at least), but this is too much to ask I guess).
biodranik commented 2022-04-25 07:53:46 +00:00 (Migrated from github.com)

We can already compare it for existing languages without delays.

We can already compare it for existing languages without delays.
vng commented 2022-04-29 15:14:43 +00:00 (Migrated from github.com)

Не удалось получить результат. Сделал сборку с этим PR, дошло до скачивания, htop показывал что все потоки что-то качают, так шло 36 часов, по итогу на диске вижу только ~100 статей (предидущий раз скачало все - там тысячи статей на диске). Лог в python скрипте никакой, понять что происходит нельзя.

Тут надо отдельно браться за скрипты и красиво их доводить до ума. Начиная с StageDownloadDescriptions. Или на основе него делать отдельный сервис, который будет качать и апдейтить базу статей.

Я могу дать wiki_urls.txt и id_to_wikidata.csv из последней сборки.

Не удалось получить результат. Сделал сборку с этим PR, дошло до скачивания, htop показывал что все потоки что-то качают, так шло 36 часов, по итогу на диске вижу только ~100 статей (предидущий раз скачало все - там тысячи статей на диске). Лог в python скрипте никакой, понять что происходит нельзя. Тут надо отдельно браться за скрипты и красиво их доводить до ума. Начиная с StageDownloadDescriptions. Или на основе него делать отдельный сервис, который будет качать и апдейтить базу статей. Я могу дать wiki_urls.txt и id_to_wikidata.csv из последней сборки.
AntonM030481 commented 2022-06-05 19:37:22 +00:00 (Migrated from github.com)

What about separation of map data and wiki (separately for each language), and downloading by user map data + wiki for current OM language only?
We will need auto-conversion from less popular language to more popular, if there is no article for it. E.g. BY, UA -> RU, RU -> EN.
If user changes OM language - map updates become available for him.

What about separation of map data and wiki (separately for each language), and downloading by user map data + wiki for current OM language only? We will need auto-conversion from less popular language to more popular, if there is no article for it. E.g. BY, UA -> RU, RU -> EN. If user changes OM language - map updates become available for him.
rtsisyk reviewed 2022-06-10 06:53:24 +00:00
@ -140,3 +140,3 @@
x.extract()
soup = remove_bad_sections(soup, lang)
# soup = remove_bad_sections(soup, lang)
html = str(soup.prettify())

I don't worry about MWM size, to be honest. Current wikipedia content in the app is good.

I don't worry about MWM size, to be honest. Current wikipedia content in the app is good.
Author
Member

What about separation of map data and wiki (separately for each language), and downloading by user map data + wiki for current OM language only?

Its a separate and a much bigger task, its better to discuss it in a separate issue.

> What about separation of map data and wiki (separately for each language), and downloading by user map data + wiki for current OM language only? Its a separate and a much bigger task, its better to discuss it in a separate issue.
Author
Member

@vng did the last maps update use old wiki data? or you managed to re-download all the articles?

@vng did the last maps update use old wiki data? or you managed to re-download all the articles?
vng commented 2022-06-22 16:28:25 +00:00 (Migrated from github.com)

The old one, didn't try after the last attempt a month ago.

The old one, didn't try after the last attempt a month ago.
euf commented 2022-08-02 07:58:59 +00:00 (Migrated from github.com)

Please don’t. I use Wikipedia article dumps a lot while traveling with OM. Decreasing map dumps size doesn’t justify deleting very valuable (and for some users main) functionality.

Please don’t. I use Wikipedia article dumps a lot while traveling with OM. Decreasing map dumps size doesn’t justify deleting very valuable (and for some users main) functionality.
This repo is archived. You cannot comment on pull requests.
No reviewers
No labels
Accessibility
Accessibility
Address
Address
Android
Android
Android Auto
Android Auto
Android Automotive (AAOS)
Android Automotive (AAOS)
API
API
AppGallery
AppGallery
AppStore
AppStore
Battery and Performance
Battery and Performance
Blocker
Blocker
Bookmarks and Tracks
Bookmarks and Tracks
Borders
Borders
Bug
Bug
Build
Build
CarPlay
CarPlay
Classificator
Classificator
Community
Community
Core
Core
CrashReports
CrashReports
Cycling
Cycling
Desktop
Desktop
DevEx
DevEx
DevOps
DevOps
dev_sandbox
dev_sandbox
Directions
Directions
Documentation
Documentation
Downloader
Downloader
Drape
Drape
Driving
Driving
Duplicate
Duplicate
Editor
Editor
Elevation
Elevation
Enhancement
Enhancement
Epic
Epic
External Map Datasets
External Map Datasets
F-Droid
F-Droid
Fonts
Fonts
Frequently User Reported
Frequently User Reported
Fund
Fund
Generator
Generator
Good first issue
Good first issue
Google Play
Google Play
GPS
GPS
GSoC
GSoC
iCloud
iCloud
Icons
Icons
iOS
iOS
Legal
Legal
Linux Desktop
Linux Desktop
Linux packaging
Linux packaging
Linux Phone
Linux Phone
Mac OS
Mac OS
Map Data
Map Data
Metro
Metro
Navigation
Navigation
Need Feedback
Need Feedback
Night Mode
Night Mode
NLnet 2024-06-281
NLnet 2024-06-281
No Feature Parity
No Feature Parity
Opening Hours
Opening Hours
Outdoors
Outdoors
POI Info
POI Info
Privacy
Privacy
Public Transport
Public Transport
Raw Idea
Raw Idea
Refactoring
Refactoring
Regional
Regional
Regression
Regression
Releases
Releases
RoboTest
RoboTest
Route Planning
Route Planning
Routing
Routing
Ruler
Ruler
Search
Search
Security
Security
Styles
Styles
Tests
Tests
Track Recording
Track Recording
Translations
Translations
TTS
TTS
UI
UI
UX
UX
Walk Navigation
Walk Navigation
Watches
Watches
Web
Web
Wikipedia
Wikipedia
Windows
Windows
Won't fix
Won't fix
World Map
World Map
No milestone
No project
No assignees
3 participants
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: organicmaps/organicmaps-tmp#2410
No description provided.