lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: AW: solr cell/tika: pdf import with xml metatags
Date Tue, 27 Oct 2009 12:16:48 GMT
You can send PDF files via SolrJ:  http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/

I'm sure the various other clients could do the same thing.  All you  
really need is a way to upload the files.

Still, sending lots of rich docs over the wire isn't always the best  
way, either.  You may want to write your own client side API using  
Tika to do that.

-Grant

On Oct 27, 2009, at 6:49 AM, <Markus.Rietzler@rzf.fin-nrw.de> <Markus.Rietzler@rzf.fin-nrw.de

 > wrote:

> thanxs,
> i know and read that page. sending additional meta-tags with the  
> curl call is
> no problem. i only thought that there might be a way to use the xml- 
> approach
> also with PDF files. i'll go the "curl"-way for that files.
>
> --
> mit freundlichen Grüßen
>
> Markus Rietzler - <rietzler_software/>
> Rechenzentrum der Finanzverwaltung NRW
> 0211/4572-2130
>
>
>> -----Ursprüngliche Nachricht-----
>> Von: Grant Ingersoll [mailto:gsingers@apache.org]
>> Gesendet: Dienstag, 27. Oktober 2009 11:43
>> An: solr-user@lucene.apache.org
>> Betreff: Re: solr cell/tika: pdf import with xml metatags
>>
>>
>> On Oct 27, 2009, at 6:36 AM, <Markus.Rietzler@rzf.fin-nrw.de>
>> <Markus.Rietzler@rzf.fin-nrw.de
>>> wrote:
>>
>>> hi,
>>>
>>> we want to use SOLR as our intranet search engine.
>>> i downloaded the nightly bild of solr 1.4. pdf extraction does via
>>> Solr Cell/Tika. i can send the pdf via curl
>>> to solr.
>>>
>>> we do have a large set of meta-tags to all our intranet documents,
>>> including PDF, PPT etc. to import html
>>> files from our CMS i have access to all of this meta tags
>> and create
>>> a xml document which i send to SOLR,
>>>
>>> eg.
>>>
>>> <?xml version='1.0' encoding='UTF-8'?>
>>> <add>
>>> <doc>
>>> <field name="id">1</field>
>>> <field name="title">this is the title</field>
>>> </doc>
>>> <doc>
>>> <field name="id">2</field>
>>> <field name="title">this is another title</field>
>>> </doc>
>>> <doc>
>>> <field name="id">3</field>
>>> <field name="title">this is the third title</field>
>>> </doc>
>>> </add>
>>>
>>> this works fine with html files where i can grab all the
>> meta tags,
>>> including "body".
>>>
>>> so my question is, can i use this xml-document to send a pdf file
>>> also?
>>
>> I'm not sure what you mean here, can you clarify?  PDF and other
>> "rich" documents can't be sent by XML.
>>
>>> ok, one way would be to use
>>> the extracthandler with extract only and put the data in
>> the "body"-
>>> field.
>>
>> I guess all I can point you at right now is the wiki:
>> http://wiki.apache.org/solr/ExtractingRequestHandler
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message