lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: PDF Indexing
Date Thu, 03 Apr 2014 04:09:49 GMT
I see that the PDFBox library (which is what Tika uses for PDF files) has 
methods to manipulate individual pages:
http://stackoverflow.com/questions/6839787/reading-a-particular-page-from-a-pdf-document-using-pdfbox

-- Jack Krupansky

-----Original Message----- 
From: Ahmet Arslan
Sent: Wednesday, April 2, 2014 3:35 PM
To: solr-user@lucene.apache.org
Subject: Re: PDF Indexing

Hi Sujatha,

There is no built in mechanism. Prepare page documents outside of the solr.
http://searchhub.org/2012/02/14/indexing-with-solrj/


And you may want to save text content somewhere too. If you change something 
in index analysis/schema you need to reindex. If you save text data, you can 
skip extraction phase at least.


Ahmet



On Wednesday, April 2, 2014 10:05 PM, Sujatha Arun <suja.arun@gmail.com> 
wrote:
Hi,

I  am able to use TIKA and DIH to  Index a pdf as a single document.However
I need each page to be single document. Is there any inbuilt mechanism to
achieve the same or do I have to use pdfbox or any other tool achieve this?

Regards 


Mime
View raw message