lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sascha Szott <sz...@zib.de>
Subject Re: Indexing file content with custom field
Date Wed, 02 Dec 2009 19:14:51 GMT
Piero,

it sounds you're looking for an integration of Solr Cell and Solr's DIH 
facility -- a feature that isn't implemented yet (but the issue is 
already addressed in Solr-1358).

As a workaround, you could store the extracted contents in plain text 
files (either by using Solr Cell or Apache Tika directly, which is under 
the hood of Solr Cell). Afterwards, you could use DIH's 
XPathEntityProcessor (to read the metadata in your XML files) in 
conjunction with DIH's PlainTextEntityProcessor (to read the previously 
created text files).

Another workaround would be to pass the metadata content as literal 
parameters along with the /update/extract request, as described in [1]. 
This would require you to write a small program that constructs and 
sends appropriate POST requests by parsing your XML metadata files.

Best,
Sascha

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Literals

Rodolico Piero wrote:
> Hi,
> 
> I need to index the contents of a file (doc, pdf, ecc) and a set of
> custom metadata specified in the XML like a standard request to Solr.
> From the documentation I can extract the contents of a file with the
> request "/update/extract" (tika) and index metadata with a second
> request "/update" by passing the XML. How do I do it all in a single
> request? (without using curl but using http java lib or solrj). For
> example (although I know that is not correct):
> 
> <add>
>   <doc>
>     <field name="id> </ field>
>     <field name="myfield-1> </ field>
>     <field name="myfield-n> </ field>
>     <field name="content"> content of the extracted file (text) </
> field>
>     </doc>
>   </add>
> 
> So I search it or by using metadata or full text on the content.
> Sorry for my English ...
> 
> Thanks a lot.
> 
>  
> 
> Piero
> 
>  
> 
> 



Mime
View raw message