lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: PDF Indexing
Date Thu, 03 Apr 2014 04:09:49 GMT
I see that the PDFBox library (which is what Tika uses for PDF files) has 
methods to manipulate individual pages:

-- Jack Krupansky

-----Original Message----- 
From: Ahmet Arslan
Sent: Wednesday, April 2, 2014 3:35 PM
Subject: Re: PDF Indexing

Hi Sujatha,

There is no built in mechanism. Prepare page documents outside of the solr.

And you may want to save text content somewhere too. If you change something 
in index analysis/schema you need to reindex. If you save text data, you can 
skip extraction phase at least.


On Wednesday, April 2, 2014 10:05 PM, Sujatha Arun <> 

I  am able to use TIKA and DIH to  Index a pdf as a single document.However
I need each page to be single document. Is there any inbuilt mechanism to
achieve the same or do I have to use pdfbox or any other tool achieve this?


View raw message