116 lines
3.9 KiB
Text
116 lines
3.9 KiB
Text
|
|
Please read the LICENSE file, which is shipping with this software.
|
|
|
|
|
|
*** QUICK START ***
|
|
|
|
For compilation of the C library call "make c-library", for compilation of
|
|
the ruby library call "make ruby-library" and for compilation of the
|
|
PostgreSQL extension call "make pgsql-library".
|
|
|
|
For ruby you can also create a gem-file by calling "make ruby-gem".
|
|
|
|
"make all" can be used to build everything, but both ruby and PostgreSQL
|
|
installations are required in this case.
|
|
|
|
|
|
*** GENERAL INFORMATION ***
|
|
|
|
The C library is found in this directory after successful compilation and
|
|
is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
|
|
the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
|
|
subdirectory "ruby/". If you chose to create a gem-file it is placed in the
|
|
"ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
|
|
and resides in the "pgsql/" directory.
|
|
|
|
Both the ruby library and the PostgreSQL extension are built as stand-alone
|
|
libraries and are therefore not dependent the dynamic version of the
|
|
C library files, but this behaviour might change in future releases.
|
|
|
|
The Unicode version being supported is 5.0.0.
|
|
Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
|
|
version 5.0.0 had not been available at the time of implementation.
|
|
|
|
For Unicode normalizations, the following options have to be used:
|
|
Normalization Form C: STABLE, COMPOSE
|
|
Normalization Form D: STABLE, DECOMPOSE
|
|
Normalization Form KC: STABLE, COMPOSE, COMPAT
|
|
Normalization Form KD: STABLE, DECOMPOSE, COMPAT
|
|
|
|
|
|
*** C LIBRARY ***
|
|
|
|
The documentation for the C library is found in the utf8proc.h header file.
|
|
"utf8proc_map" is most likely function you will be using for mapping UTF-8
|
|
strings, unless you want to allocate memory yourself.
|
|
|
|
|
|
*** RUBY API ***
|
|
|
|
The ruby library adds the methods "utf8map" and "utf8map!" to the String
|
|
class, and the method "utf8" to the Integer class.
|
|
|
|
The String#utf8map method does the same as the "utf8proc_map" C function.
|
|
Options for the mapping procedure are passed as symbols, i.e:
|
|
"Hello".utf8map(:casefold) => "hello"
|
|
|
|
The descriptions of all options are found in the C header file
|
|
"utf8proc.h". Please notice that the according symbols in ruby are all
|
|
lowercase.
|
|
|
|
String#utf8map! is the destructive function in the meaning that the string
|
|
is replaced by the result.
|
|
|
|
There are shortcuts for the 4 normalization forms specified by Unicode:
|
|
String#utf8nfd, String#utf8nfd!,
|
|
String#utf8nfc, String#utf8nfc!,
|
|
String#utf8nfkd, String#utf8nfkd!,
|
|
String#utf8nfkc, String#utf8nfkc!
|
|
|
|
The method Integer#utf8 returns a UTF-8 string, which is containing the
|
|
unicode char given by the code point.
|
|
0x000A.utf8 => "\n"
|
|
0x2028.utf8 => "\342\200\250"
|
|
|
|
|
|
*** POSTGRESQL API ***
|
|
|
|
For PostgreSQL there are two SQL functions supplied named "unifold" and
|
|
"unistrip". These functions function can be used to prepare index fields in
|
|
order to be folded in a way where string-comparisons make more sense, e.g.
|
|
where "bathtub" == "bath<soft hyphen>tub"
|
|
or "Hello World" == "hello world".
|
|
|
|
CREATE TABLE people (
|
|
id serial8 primary key,
|
|
name text,
|
|
CHECK (unifold(name) NOTNULL)
|
|
);
|
|
CREATE INDEX name_idx ON people (unifold(name));
|
|
SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
|
|
|
|
The function "unistrip" removes character marks like accents or diaeresis,
|
|
while "unifold" keeps then.
|
|
|
|
NOTICE: The outputs of the function can change between releases, as
|
|
utf8proc does not follow a versioning stability policy. You have to
|
|
rebuild your database indicies, if you upgrade to a newer version
|
|
of utf8proc.
|
|
|
|
|
|
*** TODO ***
|
|
|
|
- detect stable code points and process segments independently in order to
|
|
save memory
|
|
- do a quick check before normalizing strings to optimize speed
|
|
- support stream processing
|
|
|
|
|
|
*** CONTACT ***
|
|
|
|
If you find any bugs or experience difficulties in compiling this software,
|
|
please contact us:
|
|
|
|
Project page: http://www.public-software-group.org/utf8proc
|
|
|
|
|