lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sascha Szott <sz...@zib.de>
Subject Re: How to use DataImportHandler with ExtractingRequestHandler?
Date Thu, 03 Sep 2009 17:49:43 GMT
Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3): 
For each row, extract the content from the corresponding pdf file using 
a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
in case you need to process other file types as well), put it between

	<foo><![CDATA[

and

	]]></foo>

and store it in a text file. To keep the relationship between a file and 
its corresponding database row, use the primary key as the file name.

Within data-config.xml use the XPathEntityProcessor as follows (replace 
dbRow and primaryKey respectively):

<entity name="pdfcontent"
	processor="XPathEntityProcessor"
	forEach="/foo"
	url="${dbRow.primaryKey}.xml">
   <field column="pdftext" xpath="/foo"/>
</entity>


And, by the way, in Solr 1.4 you do not have to put your content between 
xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.

Best,
Sascha

Khai Doan schrieb:
> Hi all,
> 
> My name is Khai.  I have a table in a relational database.  I have
> successfully use DataImportHandler to import this data into Apache Solr.
> However, one of the column store the location of PDF file.  How can I
> configure DataImportHandler to use ExtractingRequestHandler to extract the
> content of the PDF?
> 
> Thanks!
> 
> Khai Doan
> 


Mime
View raw message