2012-02-09 77 views
0

我有一個帶有UTF-8文檔的Sphinx索引,特別是藝術家的名字。由於各種原因,我們的名稱既是一個字段(indexed_name),也是一個屬性(名稱)。當我搜索一個文件,我找到正確的,但屬性被返回損壞:具有UTF-8字符串屬性的獅身人面像

mysql> select name from artist where match('@indexed_name Sánchez') limit 3; 
+---------+--------+-----------------------+ 
| id  | weight | name     | 
+---------+--------+-----------------------+ 
| 7843884 | 2642 | Sarita Sánchez  | 
| 8519538 | 2642 | Cristhian Sánchez | 
| 3853986 | 2627 | Alfonso Sánchez | 
+---------+--------+-----------------------+ 
3 rows in set (0.02 sec) 

它看起來像屬性最初是UTF-8,但被視爲ISO-8859-1,然後轉換回到UTF-8。當我做這在Ruby中,它看起來像它經過了第二遍:

[1] pry(main)> rs = Thebes::Sphinxql::Query.run("select name from artist where match('@indexed_name Sánchez')") 
=> #<Mysql2::Result:0x000000029bebf8 (omitted...) 
[2] pry(main)> name = rs.first['name'] 
=> "Sarita SÃ\u0083¡nchez" 

這是獅身人面像的錯誤,還是我做錯了什麼?

我可以通過ISO-8859-1和UTF-8騎車它扭轉它:

[4] pry(main)> name.encode!("ISO-8859-1") 
=> "Sarita S\xC3\x83\xC2\xA1nchez" 
[5] pry(main)> name.force_encoding("UTF-8") 
=> "Sarita Sánchez" 
[6] pry(main)> name.encode!("ISO-8859-1") 
=> "Sarita S\xC3\xA1nchez" 
[7] pry(main)> name.force_encoding("UTF-8") 
=> "Sarita Sánchez" 

那是去上班,不過,在其他ISO-8859- *字符集和字符合法需要Unicode的東西?

更新1:

回答第二個問題是否定的。搜索土耳其名稱:

mysql> select name from artist where match('@indexed_name ÖZDEMİR') limit 3; 

+---------+--------+-------------------------------+ 
| id  | weight | name       | 
+---------+--------+-------------------------------+ 
| 1753230 | 2664 | Nurullah Alper ÖZDEMİR | 
| 6973956 | 2664 | YİĞİT ÖZDEMİR | 
| 9133770 | 2664 | TAHA ÖZDEMİR   | 
+---------+--------+-------------------------------+ 
3 rows in set (0.01 sec) 

第二個有被認爲是「YİĞİTÖZDEMİR。」

[2] pry(main)> rs = Thebes::Sphinxql::Query.run("select name from artist where match('@indexed_name ÖZDEMİR') limit 3") 
=> #<Mysql2::Result:0x000000047779b0... 
[5] pry(main)> name = rs.to_a[1]['name'].dup 
=> "YÃ\u0084°Ã\u0084žÃ\u0084°T Ã\u0083â\u0080\u0093ZDEMÃ\u0084°R" 
[6] pry(main)> name.encode!("ISO-8859-1") 
=> "Y\xC3\x84\xC2\xB0\xC3\x84\xC5\xBE\xC3\x84\xC2\xB0T \xC3\x83\xE2\x80\x93ZDEM\xC3\x84\xC2\xB0R" 
[7] pry(main)> name.force_encoding("UTF-8") 
=> "YİĞİT ÖZDEMİR" 
[8] pry(main)> name.encode!("ISO-8859-1") 
Encoding::UndefinedConversionError: U+017E from UTF-8 to ISO-8859-1 
from (pry):8:in `encode!' 

我不知道Ö是怎麼翻在地A-,這似乎是五個字節寬...

更新2:

我不想發佈我的整個sphinx.conf,但這裏是在這裏使用的索引配置。它由Thinking Sphinx生成。

source artist_core_0 
{ 
    type = mysql 
    sql_host = (omitted) 
    sql_user = (omitted) 
    sql_pass = (omitted) 
    sql_db = (omitted) 
    sql_query_pre = SET NAMES utf8 
    sql_query_pre = SET TIME_ZONE = '+0:00' 
    sql_query = (omitted) 
    sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1) FROM `artists` 
    sql_attr_uint = sphinx_internal_id 
    sql_attr_uint = sphinx_deleted 
    sql_attr_uint = class_crc 
    sql_attr_float = latitude 
    sql_attr_float = longitude 
    sql_attr_string = sphinx_internal_class 
    sql_attr_string = name 
    sql_attr_string = homepage 
    sql_attr_string = image 
    sql_attr_string = city 
    sql_attr_string = state 
    sql_attr_string = postal_code 
    sql_attr_string = country 
    sql_query_info = SELECT * FROM `artists` WHERE `id` = (($id - 0)/6) 
} 

index artist_core 
{ 
    source = artist_core_0 
    path = (omitted) 
    morphology = libstemmer_en, libstemmer_fr, libstemmer_tr, libstemmer_es, libstemmer_de, libstemmer_it 
    charset_type = utf-8 
    min_prefix_len = 3 
    enable_star = 1 
} 

index artist 
{ 
    type = distributed 
    local = artist_core 
} 
+0

你可以附加sphinx.conf嗎? – 2012-02-10 04:42:40

回答

0

沒關係。我們數據庫中的數據是雙重編碼的。