lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HUYLEBROECK Jeremy RD-ILAB-SSF" <jeremy.huylebro...@orange-ftgroup.com>
Subject RE: Unicode characters
Date Tue, 01 May 2007 20:22:19 GMT

Thanks a lot for the time you spent understanding my problem and
checking for a solution in Neko!
It helps a lot.


-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Friday, April 27, 2007 4:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Unicode characters 


: -fetch a web page
: -decode entities and unicode characters(such as $#149; ) using Neko
: library
: -get a unicode String in Java
: -Sent it to SOLR through XML created by SAX, with the right encoding
: (UTF-8) specified everywhere( writer, header etc...)
: -it apparently arrives clean on the SOLR side (verified in our logs).
: -In the query output from SOLR (XML message), the character is not
: encoded as an entity (not &#149;) but the character itself is used
: (character 149=95 hexadecimal).

Just because someone uses an html entity to display a character in a web
page doesn't mean it needs to be "escaped" in XML ... i think that in
theory we could use numeric entities to escape *every* character but
that would make the XML responses a lot bigger ... so in general Solr
only escapes the characters that need to be escaped to have a valid
UTF-8 XML response.

Your may also be having some additional problems since 149 (hex 95) is
not a printable UTF-8 character, it's a control character
(MESSAGE_WAITING) ... it sounds like you're dealing with HTML where
people were using the numeric value from the "Windows-1252" charset.

you may want to modify your parsing code to do some mappings between
"control" characters that you know aren't ment to be control characters
before you ever send them to solr.  a quick search for "Neko
windows-1525" indicates that enough people have had problems with this
that it is a built in feature...
    http://people.apache.org/~andyc/neko/doc/html/settings.html
    "http://cyberneko.org/html/features/scanner/fix-mswindows-refs
     Specifies whether to fix character entity references for Microsoft
     Windows characters as described at
     http://www.cs.tut.fi/~jkorpela/www/windows-chars.html."

(I've run into this a number of times over the years when dealing with
content created by windows users, as you can see from my one and only
thread on "JavaJunkies" ...
  http://www.javajunkies.org/index.pl?node_id=3436
)


-Hoss


Mime
View raw message