Issue 2787 - Facebook validation fix #2793
Labels
No labels
Accessibility
Accessibility
Address
Address
Android
Android
Android Auto
Android Auto
Android Automotive (AAOS)
Android Automotive (AAOS)
API
API
AppGallery
AppGallery
AppStore
AppStore
Battery and Performance
Battery and Performance
Blocker
Blocker
Bookmarks and Tracks
Bookmarks and Tracks
Borders
Borders
Bug
Bug
Build
Build
CarPlay
CarPlay
Classificator
Classificator
Community
Community
Core
Core
CrashReports
CrashReports
Cycling
Cycling
Desktop
Desktop
DevEx
DevEx
DevOps
DevOps
dev_sandbox
dev_sandbox
Directions
Directions
Documentation
Documentation
Downloader
Downloader
Drape
Drape
Driving
Driving
Duplicate
Duplicate
Editor
Editor
Elevation
Elevation
Enhancement
Enhancement
Epic
Epic
External Map Datasets
External Map Datasets
F-Droid
F-Droid
Fonts
Fonts
Frequently User Reported
Frequently User Reported
Fund
Fund
Generator
Generator
Good first issue
Good first issue
Google Play
Google Play
GPS
GPS
GSoC
GSoC
iCloud
iCloud
Icons
Icons
iOS
iOS
Legal
Legal
Linux Desktop
Linux Desktop
Linux packaging
Linux packaging
Linux Phone
Linux Phone
Mac OS
Mac OS
Map Data
Map Data
Metro
Metro
Navigation
Navigation
Need Feedback
Need Feedback
Night Mode
Night Mode
NLnet 2024-06-281
NLnet 2024-06-281
No Feature Parity
No Feature Parity
Opening Hours
Opening Hours
Outdoors
Outdoors
POI Info
POI Info
Privacy
Privacy
Public Transport
Public Transport
Raw Idea
Raw Idea
Refactoring
Refactoring
Regional
Regional
Regression
Regression
Releases
Releases
RoboTest
RoboTest
Route Planning
Route Planning
Routing
Routing
Ruler
Ruler
Search
Search
Security
Security
Styles
Styles
Tests
Tests
Track Recording
Track Recording
Translations
Translations
TTS
TTS
UI
UI
UX
UX
Walk Navigation
Walk Navigation
Watches
Watches
Web
Web
Wikipedia
Wikipedia
Windows
Windows
Won't fix
Won't fix
World Map
World Map
No milestone
No project
No assignees
2 participants
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/organicmaps-tmp#2793
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "issue-2787-facebook-validation-fix"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Working on #2787 issue.
Looks like FB validation fails because of extra character
á
(Latin Small Letter A with acute). It's part of Latin-1 Supplement part of codepage.This extra char code is
\xE1
. But when we get string from Java to C++ the char is converted to unicode bytes\xC3\xA1
. How to write correct regex to cover 2 bytes chars?And another question: should we also validate other Latin-1 Supplement characters such as
Ç
,Ó
,ß
,ñ
?Update 2022-06-27
Facebook allows very broad range of symbols in it's page urls. So validation of Facebook page input is now simplified: all characters are allowed except
!@^*()~@[]{}#$%&;,:+"'/\
. This will fix issue #2787 and allow all sort of non-latin alphabets in OrganicMaps.Note: some symbols are allowed by OrganicMaps but not in FB. For example:
£¤¥©®
. Eliminating these exceptions is hard unicode related task and is out if the scope of this PR.Ok, I digged a little into Facebook pages and got some not funny news.
Go to https://www.facebook.com/pages/create and put any strange symbols in "Name" field. You can create a page with very wide range of symbols in it's name. And those symbols could appear in URL.
!@^*()~@©[]{}¢¦¨®¯
False
#$%&;,.:'+/\
False
¶«»"'/¡£¤¥§¬
False
ª À Á Â Ã Ä Å Æ Ç
True
È É Ê Ë Ì Í Î Ï
True
º
True
¼½¾
False
オーガニックマップ
True
מפות אורגניות
(R-to-L lang)True
It means if you create a page with name
MÊGÅ «¿¶» CÄFË º¼½¾
then FB will register URL:MÊGÅ--CÄFË-º¼½¾-3141592653589793
(you can test it yourself 😉)So we need to allow only some symbols and all alphanum symbols. This is not 100% solution (symbols
¼½¾
would be marked as invalid) but we can live with it.In Python it would look like:
Question: is there any
isalnum()
functions in C++ for Unicode characters testing?Well, with this logic:
std::regex should more or less work with utf-8. Let's find a cleaner solution. So far please create some unit tests, btw they could be built for android and run on a device.
@vng your solution with
NormalizeInplace
function will work. After normalization we can use regex to find Latin characters and numbers.But now to check non-Latin letters: Cyrillic or Japanese symbols? Regex doesn't support it.
We can continue with
NormalizeInplace
and accept the fact that OM will mark some Facebook pages as invalid. E.g. https://osm.org/node/5296186621 hascontact:facebook=https://ru-ru.facebook.com/Институт-повышения-квалификации-и-переподготовки-экономических-кадров-БГЭУ-1418862625037617/)
No. It's better to turn normalize off completely for these pages if there are no other clean solutions. Or add proper unicode ranges into the std::regex.
I agree with Alex to take FB links as is, without normalization.
If we need to check if some unicode char is letter or symbol we need to have allowed ranges.
I wrote small python script to understand which Unicode chars are letters from Python 3.9 perspective: unicode-ranges-by-python.txt
We can write C++ function to check if
UniChar
hits any of those 250 ranges.Looks like Facebook uses same unicode table because I was able to create page with name
"Boat ᤀ ᥐ ᦀ ᨖ ᨠ ᐁ ᎀ ዀ ჺ ၕ ༀ ഒ"
and URL:https://www.facebook.com/Boat-ᤀ-ᥐ-ᦀ-ᨖ-ᨠ-ᐁ-ᎀ-ዀ-ჺ-ၕ-ༀ-ഒ-102615019173012
😮How do you think? Should I hardcode ranges and verify
UniChar
code to detect alphanum? (Did you even thought about using icu lib?)Why should we check and modify FB links?
If the link is correct, it should open without corrections.
If the link is incorrect, it won't open in any case, no?
@vng in MWM file and in editor we store only username or page name. If user enters
https://www.facebook.com/hlekrestaurant
then functionstring ValidateAndFormat_facebook(string const & facebookPage)
will cut only page name"hlekrestaurant"
and show it on PlacePage and in editor.OM editor should support not only URLs but short usernames too. And we need to support non-latin characters validation in FB usernames.
So, if we will keep username/page name without any validation?
I think we need some validation of user input, otherwise invalid data would be posted to OSM. But instead of unicode magic I can simplify rules: allow all chars except some symbols
"!@^*()~©[]{}®#$%&;,:'+/\
. If such username check fails then validate string as an URL.How do you think?
Looks like a good idea!
Agree, lets filter invalid url ascii symbols only.
@ -7,0 +13,4 @@
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("http://www.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://www.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://en-us.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
Why it is not possible to encode UTF-8 strings directly, or with
u8"строка"
prefix?Did you try
or the same with u8 prefix?
@ -25,2 +125,3 @@
TEST(!osm::ValidateFacebookPage("osm"), ());
TEST(!osm::ValidateFacebookPage("invalid_username"), ());
TEST(!osm::ValidateFacebookPage("@spaces are not welcome here"), ());
TEST(!osm::ValidateFacebookPage("spaces are not welcome here"), ());
Is / allowed?
Это точно все базовые запрещённые символы? Может надёжнее проверка типа
c < '0'
, чтобы отсеять контрольные символы ASCII таблицы?Велосипед уже давно изобретён :)
if (std::string::npos == str.find_first_of(kForbiddenFBSymbols)) { не найдено }
Весь этот блок можно написать проще:
Собачка дублировалась.
@ -7,0 +13,4 @@
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("http://www.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://www.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://en-us.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
I tried. The problem is that
find_first_of(...)
method compares two strings byte by byte. And that's why:Because in UTF encoding two strings
"¥"
and"ქ"
have common byte\xa5
. We need to compare unicode symbols not UTF8 bytes to search for symbols'£€¥'
.Я подумал, что если
facebookPage.front() == '@'
, то нет смысла проверять на URL с помощьюValidateWebsite(facebookPage)
.В предложенном тобою коде всё таки возможна двойная проверка для строк начинающихся с
@
.@ -7,0 +13,4 @@
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("http://www.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://www.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://en-us.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
Rewritten tests with
u8"тестовая строка"
syntax.@ -25,2 +125,3 @@
TEST(!osm::ValidateFacebookPage("osm"), ());
TEST(!osm::ValidateFacebookPage("invalid_username"), ());
TEST(!osm::ValidateFacebookPage("@spaces are not welcome here"), ());
TEST(!osm::ValidateFacebookPage("spaces are not welcome here"), ());
Added test with symbols
/@#
. There are more forbiddeb symbols[]{}#$%&;,:
. Need separate test string for each 😔Удалил свой велосипед ))
@ -7,0 +13,4 @@
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("http://www.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://www.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
TEST_EQUAL(osm::ValidateAndFormat_facebook("https://en-us.facebook.com/OpenStreetMap"), "OpenStreetMap", ());
Very good catch!
Then you can try to use either utf8cpp iterators by
#include "3party/utfcpp/source/utf8/unchecked.h"
, or use strings::MakeUniString and compare u32 characters. Both methods should usestd::find_first_of
@ -18,2 +17,4 @@
static auto const s_lineRegex = regex(R"(^[a-z0-9-_.]{4,20}$)");
// TODO: Current implementation looks only for restricted symbols from ASCII block ignoring
Not making additional substr copy: facebookPage.find_first_of(kForbiddenFBSymbols, 1)
Not making additional substr copy: page.find_first_of(kForbiddenFBSymbols, 1)
@ -18,2 +17,4 @@
static auto const s_lineRegex = regex(R"(^[a-z0-9-_.]{4,20}$)");
// TODO: Current implementation looks only for restricted symbols from ASCII block ignoring
Removed
page.substr(1)
Removed
page.substr(1)
@vng,
ValidateFacebookPage
function is used by Android App. SeeEditor.nativeIsFacebookPageValid(String)
. When user enters Facebook contact in OrganicMaps editor, this method validates input.ValidateAndFormat_facebook
is used in generator and Android App, but is not covered with tests. I'll write unit-tests for it.@ -25,2 +125,3 @@
TEST(!osm::ValidateFacebookPage("osm"), ());
TEST(!osm::ValidateFacebookPage("invalid_username"), ());
TEST(!osm::ValidateFacebookPage("@spaces are not welcome here"), ());
TEST(!osm::ValidateFacebookPage("spaces are not welcome here"), ());
I added unit test where in a loop verify validation reject for each forbidden symbol
Поменял проверку: вместо списка запрещённых символов проверяю
(ch >= ' ' && ch <= ',') || (ch >= '[' && ch <= '^') || ...
Официальной документации Facebook не предоставил, поэтому Я запретил все символы в ASCII диапазоне, кроме
'-'
,'.'
и'_'
(проверил вручную -- символы действительно запрещены в именах и URL-ах).Как быть с Unicode символами:
£€¥
? Во-первых, у меня нет исчерпывающего списка запрещённых символов, а во-вторых, нужно применятьMakeUniString
или utf8cpp.Прелагаю так и оставить: даже если пользователь введёт "плохой" символ, который попадёт в OSM, большой проблемы это не создаст.
Оставь TODO в юнит тестах и в методах валидации на этот случай.
Добавил TODO комментарий
Added tests for
ValidateAndFormat_*
functions@ -230,8 +259,13 @@ bool ValidateFacebookPage(string const & page)
if (page.empty())
return true;
Thanks. Fixed