tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1138) I got empty body and empty title with some documents
Date Mon, 24 Jun 2013 12:30:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691929#comment-13691929
] 

Nick Burch commented on TIKA-1138:
----------------------------------

That's often a sign that the parser can't handle them. There's some discussion on the dev
list at the moment about how best to report that, but it hasn't concluded

As an example, solupro.xls is an Excel-95 file, which Apache POI (the library Tika uses for
.xls) doesn't handle, hence why you're able to get metadata but not text
                
> I got empty body and empty title with some documents
> ----------------------------------------------------
>
>                 Key: TIKA-1138
>                 URL: https://issues.apache.org/jira/browse/TIKA-1138
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.3
>         Environment: Windows 7 (my desktop)
>            Reporter: Koutsoulis Philippe
>              Labels: test
>
> *+Tested version:+* Apache Tika 1.3 (with the Apache Tika GUI)
> Hi all,
> I have empty body and empty title with some documents.
> Do you have an idea?
> *+Extract from my "Structured Text"+*
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> ...
> <title/>
> </head>
> <body/></html>
> {noformat}
> *+Files to reproduce+*
> [http://www.justice.gouv.fr/art_pix/declaration_sexe_20091016.xls]
> [http://ge.ch/ssco_gestats/excel/deinfo_par_ht2004.xls]
> [http://homepage.swissonline.ch/ccvaf1/stock_divers/palmares_ccvaf.xls]
> [http://top1000.anthologeek.net/participants.current.txt]
> [http://ge.ch/ssco_gestats/excel/refona_par_ht2006.xls]
> [http://www.rad.fr/solupro.xls]
> [http://www.pfynschiessen.ch/TClassementgroupeinvite.xls]
> [http://www.gregdonner.org/workbench/wb_31rev.txt]
> (i) No error in logs :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message