tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nirnaydewan <nirnayde...@gmail.com>
Subject Issue in text extraction in Solr / Tika
Date Fri, 19 Aug 2011 11:49:48 GMT
I am using Solr 3.3.0 using the attached jetty server. When i upload ms word
documents or pdf files, the text is not formatted properly.

1. There is no line breaks between sentences. The text is extracted in a
single line or string. 

2. Wherever there are boxes in word documents , some weird characters come
in place.

How do i keep the formatting of the text just like in the document. For e.g
if there are 3 line breaks , how do i maintain this?

Also ? characters come in text while uploading word documents. Where is the
issue?

Thanks

--
View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3267810.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Mime
View raw message