Investigate escaping in article titles and urls #7

Open
opened 2023-06-24 19:20:44 +00:00 by newsch · 0 comments
newsch commented 2023-06-24 19:20:44 +00:00 (Migrated from github.com)

Wikipedia articles can contain slashes (/). Wikipedia accepts them in urls escaped or not, e.g.
https://en.wikipedia.org/wiki/Baltimore%2FWashington_International_Airport
and
https://en.wikipedia.org/wiki/Baltimore/Washington_International_Airport
return the same page, and neither redirects to the other.

The generator attempts to decode urls from OSM tags, and then encodes '%' again when it converts them back into urls.

My guess is that some of the tags that are not urls still have url encoding in them, but determining which are actually url-encoded and which just have % in them is a little tricky, and the generator doesn't do that.

It looks like some of the resulting urls are encoded twice, thankfully a small number:

$ tail -n +2 ~/Downloads/wiki_urls.txt | cut -f 3 | grep -F '%' | sort | uniq
https://de.wikipedia.org/wiki/Georg-B%25C3%25BCchner-Platz
https://de.wikipedia.org/wiki/Kontorhaus_am_J%25C3%25B6debrunnen
https://en.wikipedia.org/wiki/Brighton_%2526_Hove_Greyhound_Stadium
https://en.wikipedia.org/wiki/de:Liste_der_Kulturdenkmäler_in_Schwachhausen#0218%252CT003
https://en.wikipedia.org/wiki/McMullen%2527s_Brewery
https://en.wikipedia.org/wiki/P%25C3%25A9cs_TV_Tower
https://en.wikipedia.org/wiki/Sedbergh_People%2527s_Hall
https://en.wikipedia.org/wiki/Sight_%2526_Sound_Theatres
https://es.wikipedia.org/wiki/100%25_Banco
https://es.wikipedia.org/wiki/Ruta_de_los_D%25C3%25B3lmenes
https://FR.wikipedia.org/wiki/Maisons_industrialis%25C3%25A9es_%25C3%25A0_Meudon
https://fr.wikipedia.org/wiki/Salm_(rivi%25C3%25A8re_de_Belgique)
https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598
https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus
https://sv.wikipedia.org/wiki/Kungliga_Tr%25C3%25A4dg%25C3%25A5rden_3
https://sv.wikipedia.org/wiki/Sverigev%25C3%25A4ggen
https://sv.wikipedia.org/wiki/V%25C3%25A4ttern,_Storfors_kommun
https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598

Of those, all except the three below are malformed:

https://es.wikipedia.org/wiki/100%25_Banco
https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598
https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598

Some seem to be arbitrary character data, for example:

https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus
with the extra escaped %25s removed becomes:
https://sv.wikipedia.org/wiki/Kanngjutarm%C3%A4starens_hus
which the browser converts to:
https://sv.wikipedia.org/wiki/Kanngjutarmästarens_hus

Wikipedia articles can contain slashes (`/`). Wikipedia accepts them in urls escaped or not, e.g. `https://en.wikipedia.org/wiki/Baltimore%2FWashington_International_Airport` and `https://en.wikipedia.org/wiki/Baltimore/Washington_International_Airport` return the same page, and neither redirects to the other. The generator attempts to decode urls from OSM tags, and then encodes '%' again when it converts them back into urls. My guess is that some of the tags that are not urls still have url encoding in them, but determining which are actually url-encoded and which just have `%` in them is a little tricky, and the generator doesn't do that. It looks like some of the resulting urls are encoded twice, thankfully a small number: ``` $ tail -n +2 ~/Downloads/wiki_urls.txt | cut -f 3 | grep -F '%' | sort | uniq https://de.wikipedia.org/wiki/Georg-B%25C3%25BCchner-Platz https://de.wikipedia.org/wiki/Kontorhaus_am_J%25C3%25B6debrunnen https://en.wikipedia.org/wiki/Brighton_%2526_Hove_Greyhound_Stadium https://en.wikipedia.org/wiki/de:Liste_der_Kulturdenkmäler_in_Schwachhausen#0218%252CT003 https://en.wikipedia.org/wiki/McMullen%2527s_Brewery https://en.wikipedia.org/wiki/P%25C3%25A9cs_TV_Tower https://en.wikipedia.org/wiki/Sedbergh_People%2527s_Hall https://en.wikipedia.org/wiki/Sight_%2526_Sound_Theatres https://es.wikipedia.org/wiki/100%25_Banco https://es.wikipedia.org/wiki/Ruta_de_los_D%25C3%25B3lmenes https://FR.wikipedia.org/wiki/Maisons_industrialis%25C3%25A9es_%25C3%25A0_Meudon https://fr.wikipedia.org/wiki/Salm_(rivi%25C3%25A8re_de_Belgique) https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598 https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus https://sv.wikipedia.org/wiki/Kungliga_Tr%25C3%25A4dg%25C3%25A5rden_3 https://sv.wikipedia.org/wiki/Sverigev%25C3%25A4ggen https://sv.wikipedia.org/wiki/V%25C3%25A4ttern,_Storfors_kommun https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598 ``` Of those, all except the three below are malformed: ``` https://es.wikipedia.org/wiki/100%25_Banco https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598 https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598 ``` Some seem to be [arbitrary character data](https://en.wikipedia.org/wiki/URL_encoding#Character_data), for example: `https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus` with the extra escaped `%25`s removed becomes: `https://sv.wikipedia.org/wiki/Kanngjutarm%C3%A4starens_hus` which the browser converts to: `https://sv.wikipedia.org/wiki/Kanngjutarmästarens_hus`
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: organicmaps/wikiparser#7
No description provided.