lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: solr cell/tika: pdf import with xml metatags
Date Tue, 27 Oct 2009 10:43:24 GMT

On Oct 27, 2009, at 6:36 AM, <Markus.Rietzler@rzf.fin-nrw.de> <Markus.Rietzler@rzf.fin-nrw.de

 > wrote:

> hi,
>
> we want to use SOLR as our intranet search engine.
> i downloaded the nightly bild of solr 1.4. pdf extraction does via  
> Solr Cell/Tika. i can send the pdf via curl
> to solr.
>
> we do have a large set of meta-tags to all our intranet documents,  
> including PDF, PPT etc. to import html
> files from our CMS i have access to all of this meta tags and create  
> a xml document which i send to SOLR,
>
> eg.
>
> <?xml version='1.0' encoding='UTF-8'?>
> <add>
> <doc>
> <field name="id">1</field>
> <field name="title">this is the title</field>
> </doc>
> <doc>
> <field name="id">2</field>
> <field name="title">this is another title</field>
> </doc>
> <doc>
> <field name="id">3</field>
> <field name="title">this is the third title</field>
> </doc>
> </add>
>
> this works fine with html files where i can grab all the meta tags,  
> including "body".
>
> so my question is, can i use this xml-document to send a pdf file  
> also?

I'm not sure what you mean here, can you clarify?  PDF and other  
"rich" documents can't be sent by XML.

> ok, one way would be to use
> the extracthandler with extract only and put the data in the "body"- 
> field.

I guess all I can point you at right now is the wiki:  http://wiki.apache.org/solr/ExtractingRequestHandler

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message