From Norberto Meijome <>
Subject Re: What is the best way to index xml data preserving the mark up?
Date Thu, 08 Nov 2007 06:20:11 GMT
On Wed, 7 Nov 2007 20:18:25 -0800 (PST)
David Neubert <> wrote:

> I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR.
> I have rich xml content (books) that need to searched at granular levels (specifically
paragraph and sentence levels very accurately, no approximations).  My source text has exact
<p></p> and <s></s> tags for this purpose.  I have built this app
in previous versions (using other search engines) indexing the text twice, (1) where every
paragraph was a virtual document and (2) where every sentence was a virtual document  -- both
extracted from the source file (which was a singe xml file for the entire book).  I have of
course thought about using an XML engine eXists or Xindices, but I am prefer to the stability
and user base and performance that Lucene/SOLR seems to have, and also there is a large body
of text that is regular documents and not well formed XML as well.
> I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml
scheme to add documents:
> <add>
>   <doc>
>     <field name="foo1">foo value 1</field>
>     <field name="foo2">foo value 2</field>
>   </doc>
>   <doc>...</doc>
> </add>
> But my problem is that I believe I need to perserve the xml markup at the paragraph and
sentence levels, so I was hoping to create a content field that could just contain the source
xml for the paragraph or sentence respectively.  There are reasons for this that I won't go
into -- alot of granular work in this app, accessing pars and sens.
> Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers)
would work great.  Still I think Lucene can do this in a field level way-- and I also can't
imagine that users who are indexing XML documents have to go through the trouble of striping
all the markup before indexing?  Hopefully I missing something basic?
> It would be great to pointed in the right direction on this matter?
> I think I need something along this line:
> <add>
>   <doc>
>     <field name="foo1">value 1</field>
>     <field name="foo2">value 2</field>
>     ....
>     <field name="content"><an xml stream with embedded source markup></field>
>   </doc>
> </add>
> Maybe the overall question -- is what is the best way to index XML content using SOLR
-- is all this tag stripping really necessary?

crazy/silly idea maybe... could you use dynamic fields, each containing a sentence, and a
reference to the paragraph it belongs to ? 
eg, (not sure if the syntax is correct..)

<dynamicField name="s_*" type="string" />

Then when you create your document you can define
  <field name="s_1_p1">{Sentence #1, Para#1}</field>
  <field name="s_2_p1">{Sentence #2, Para#1}</field>
  <field name="s_3_p1">{Sentence #3, Para#1}</field>
  <field name="s_1_p2">{Sentence #1, Para#2}</field>

I have no idea how scalable that would be. 
