tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: org.apache.tika.parser.txt.UniversalEncodingListener
Date Fri, 02 Nov 2012 19:24:04 GMT

On Nov 2, 2012, at 6:38am, Aleksandr Dubinsky wrote:

> I am having a problem with text files saved in Windows-1252 (or similar)
> encoding with LF linebreaks. Characters in the range 80 to 9F are returning
> as control codes.
> 
> Question: why is this class second-guessing Mozilla's 1252 determination
> and returning ISO 8859-1 (line 62)? What purpose does that serve?

When you say "Mozilla's 1252 determination", where is that coming from and how is that being
communicated to Tika?

Are you passing it in via the CONTENT_TYPE field in the Metadata?

-- Ken

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378





Mime
View raw message