tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Staffan <sols...@gmail.com>
Subject Re: Tika Snapshot Fails on PDF Articles
Date Fri, 10 Dec 2010 20:00:13 GMT
2010/12/10 Michael Schmitz <michael@schmitztech.com>:
> Hi,
>
> I don't think the current snapshot is parsing articles (pdfs with
> columns/beads) correctly.  The text is not in the write order as it
> intermixes text from different beads.  Try it on an academic paper.
>
> http://turing.cs.washington.edu/papers/acl08.pdf
>
> Tika App 0.8 parses the text in the right order but omits spaces.  PDFBox
> 1.3.1 parses the file wonderfully.  I attached a parsing of the pdf using
> each utility.
>
> Peace.  Michael
>
>

Could be related to https://issues.apache.org/jira/browse/TIKA-548

Mime
View raw message