lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: AW: What is the best way to index xml data preserving the mark up?
Date Fri, 09 Nov 2007 00:12:32 GMT

: Seems like Lucene/SOLR is geared to take record and non-xml-oriented 
: content and put it into XML format for ingest -- but really can't digest 
: XML content itself at all without significant setup and constraints.  I 
: am surprised -- but I could really use it for my project big time.

Lucene is geared towards indexing records containing key=>value 
pairs.  The values are then passed to "Analyzers" to break them up into 
individual terms.

Solr is geared towards providing a non-Java interface to accept those 
Documents and hand them off to Lucene, and to providing a simple way to 
define Analyzers using configuration without compiling custom java code.  
A specific XML format is one way way to communicate with Solr what those 
"records" are, CSV is another, ... other generic formats can be added as 

(Mind You -- Lucene and Solr are "geared" for a lot of things in addition 
to those, but forthe purposes of this ocnveration, and the focus on 
indexing, those are the distinction).

the  aspect of your situation that neither Solr nor Lucene 
really focus on is extracting the key->val pairs from a larger stream 
of text (ie: XML in a user defined schema).   this is where something like 
the XSLT appraoch i discribed could be helpful: you (as more of an expert 
on the XML Schema or your documents then solr) could write an XSLT for 
extracting the field=>value pairs foreach doc, to give to Solr.

you could do the same thingclient side before sending the data to Solr -- 
the Jira issue i refered to (SOLR-285 BTW) would just allow this transform 
to happen server side)


View raw message