lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: PDF text extracted without spaces
Date Fri, 03 Dec 2010 07:00:19 GMT
The text should come out as a stream of words with space, but without
any of the formatting in the PDF. Extraction is only good enough to
tell you that a word is somewhere inside a PDF file.  Can you post a
short bit of the text that it extracted?

Also, you should try this test on different PDF files that were made
with different software.

On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <> wrote:
> Hello all,
> I know, this is not the right group to ask this question, thought some of you guys might
have experienced.
> I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF
document but found spaces and new line missing. Indexing the data gives wrong result. Could
any one in this group could help me? I am using tika directly to extract the contents, which
later gets indexed.
> Regards
> Ganesh
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now!
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Lance Norskog

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message