tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksandr Dubinsky <adubin...@almson.net>
Subject org.apache.tika.parser.txt.UniversalEncodingListener
Date Fri, 02 Nov 2012 13:38:34 GMT
I am having a problem with text files saved in Windows-1252 (or similar)
encoding with LF linebreaks. Characters in the range 80 to 9F are returning
as control codes.

Question: why is this class second-guessing Mozilla's 1252 determination
and returning ISO 8859-1 (line 62)? What purpose does that serve?

Aleksandr Dubinsky
Almson Corp / x0x Source
98-10 64th Ave. Ste 3D
Rego Park, NY 11374
+1 (303) 800-4484

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message