tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Issue in text extraction in Solr / Tika
Date Fri, 19 Aug 2011 15:21:32 GMT
Can you post some example docs that don't extract correctly?

Or, better, open a Jira issue(s) and attach the documents there?

Thanks,

Mike McCandless

http://blog.mikemccandless.com

On Fri, Aug 19, 2011 at 7:49 AM, nirnaydewan <nirnaydewan@gmail.com> wrote:
> I am using Solr 3.3.0 using the attached jetty server. When i upload ms word
> documents or pdf files, the text is not formatted properly.
>
> 1. There is no line breaks between sentences. The text is extracted in a
> single line or string.
>
> 2. Wherever there are boxes in word documents , some weird characters come
> in place.
>
> How do i keep the formatting of the text just like in the document. For e.g
> if there are 3 line breaks , how do i maintain this?
>
> Also ? characters come in text while uploading word documents. Where is the
> issue?
>
> Thanks
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3267810.html
> Sent from the Apache Tika - Development mailing list archive at Nabble.com.
>

Mime
View raw message