lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aditya" <>
Subject RE: Problem with PDF extraction
Date Tue, 27 Apr 2010 07:34:29 GMT
I too faced similar problem. 


May I suggest trying pdftotext? This I observed being used by Google




Best Regards,



From: Grant Ingersoll [] On Behalf Of Grant Ingersoll
Sent: Tuesday, April 27, 2010 3:38 AM
Subject: Re: Problem with PDF extraction


Hi Marc,


Can you ask on and give more information about
any errors that occur in your Solr log plus the setup of the
ExtractingRequestHandler and related schema.




On Apr 26, 2010, at 5:04 PM, Marc Ghorayeb wrote:



I have been having problems with PDF randomly crashing the 1.4 Solr server
so i tried out the SVN version which contains a newer Tika library. On its
own, the tika app extracts correctly the content of my PDF. However, inside
Solr, when i upload a pdf file to my update/extract handler, it does not
seem to parse it (a blank file is outputted...). The literal values do get
indexed though. I have had no luck in getting the tika parsing to work. For
some reason, i get the same result whether or not the tika-parsers-0.7.jar
is present in the lib folder. Whereas if the tika-core-0.7 jar is absent, it
just crashes (which seems normal to me...).


I don't seem to be the only one having this problem (on the user mailing
list that is). Can anyone help me out? It would be greatly appreciated.


I use a fairly classic schema and default requesthandlers.


Marc Ghorayeb.



Hotmail débarque sur votre téléphone ! Paramétrez
<>  Hotmail sur votre
téléphone! Gratuit !



Grant Ingersoll


Search the Lucene ecosystem using Solr/Lucene:


View raw message