tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl
Date Wed, 16 Dec 2015 16:32:46 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060254#comment-15060254
] 

Tim Allison commented on TIKA-1813:
-----------------------------------

Duh...I initially posted the exceptions on the theory that we may be misreading an old version
of how many bytes to read, but y, truncated makes sense.

I'll post some other tika-msoffice that didn't cause exceptions.  Thank you for the tip on
the header dumper.

> Figure out file types for several unknown OLE files in Common Crawl
> -------------------------------------------------------------------
>
>                 Key: TIKA-1813
>                 URL: https://issues.apache.org/jira/browse/TIKA-1813
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 25JIANLV77U645GUSJ2E67YSM4B2TNSP,
27BYDLE36XWCDZXA3PPV6MF524UQ6KAF, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files in our current
slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
>     at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file types and
patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message