tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1003) UnsupportedEncodingException: ansi
Date Mon, 08 Oct 2012 13:42:03 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471539#comment-13471539
] 

Ken Krugler commented on TIKA-1003:
-----------------------------------

As far as I know, "ansi" isn't a valid charset name. On Windows computers the "ansi" charset
means the default (locale-specific) character encoding, thus using that as the character encoding
name for documents isn't appropriate.

But it sounds like at least in Tika we should handle "ansi" as an alias for (probably) Windows
1252, as that's the most likely equivalent mapping.

Though with the stack trace above, this is coming from POI, so the above change won't help.
You'd have to file an upstream bug report against the POI project.
                
> UnsupportedEncodingException: ansi
> ----------------------------------
>
>                 Key: TIKA-1003
>                 URL: https://issues.apache.org/jira/browse/TIKA-1003
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Windows 2008 R2; Java 1.6
>            Reporter: Alexander Veit
>
> Tika cannot decode the "ansi" encoding.
> The mail contains the headers
>  Content-Type: text/plain; charset="ansi"
>  Content-Transfer-Encoding: 7bit
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@14945
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.Tika.parseToString(Tika.java:380)
>         at de.uplanet.lucy.server.docplug.tika.TikaDocPlug.prepare(Unknown Source)
>         at de.uplanet.lucy.server.lucene.directory.DirectoryIndexManager.b(Unknown Source)
>         at de.uplanet.lucy.server.lucene.directory.DirectoryIndexManager.a(Unknown Source)
>         at de.uplanet.lucy.server.lucene.directory.DirectoryIndexManager.a(Unknown Source)
>         at de.uplanet.lucy.server.lucene.directory.DirectoryIndexManager.createIndex(Unknown
Source)
>         at de.uplanet.lucy.server.lucene.directory.LuceneDirectoryJob.doWork(Unknown
Source)
>         at de.uplanet.lucy.server.scheduler.AbstractJob.execute(Unknown Source)
>         at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
>         at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
> Caused by: java.lang.RuntimeException: Encoding not found - ansi
>         at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:155)
>         at org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:86)
>         at org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
>         at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:421)
>         at org.apache.poi.hsmf.MAPIMessage.guess7BitEncoding(MAPIMessage.java:380)
>         at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:80)
>         at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:210)
>         at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         ... 12 more
> Caused by: java.io.UnsupportedEncodingException: ansi
>         at java.lang.StringCoding.decode(StringCoding.java:170)
>         at java.lang.String.<init>(String.java:443)
>         at java.lang.String.<init>(String.java:515)
>         at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:153)
>         ... 20 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message