lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Markus.Rietz...@rzf.fin-nrw.de>
Subject solr cell/tika: pdf import with xml metatags
Date Tue, 27 Oct 2009 10:36:16 GMT
hi,

we want to use SOLR as our intranet search engine.
i downloaded the nightly bild of solr 1.4. pdf extraction does via Solr Cell/Tika. i can send
the pdf via curl
to solr.

we do have a large set of meta-tags to all our intranet documents, including PDF, PPT etc.
to import html
files from our CMS i have access to all of this meta tags and create a xml document which
i send to SOLR, 

eg.

<?xml version='1.0' encoding='UTF-8'?>
<add>
<doc>
<field name="id">1</field>
<field name="title">this is the title</field>
</doc>
<doc>
<field name="id">2</field>
<field name="title">this is another title</field>
</doc>
<doc>
<field name="id">3</field>
<field name="title">this is the third title</field>
</doc>
</add>

this works fine with html files where i can grab all the meta tags, including "body".

so my question is, can i use this xml-document to send a pdf file also? ok, one way would
be to use
the extracthandler with extract only and put the data in the "body"-field. 

is there any other way? 





--
mit freundlichen Grüßen

Markus Rietzler - <rietzler_software/>
Rechenzentrum der Finanzverwaltung NRW
0211/4572-2130


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message