lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Norberto Meijome <free...@meijome.net>
Subject Re: What is the best way to index xml data preserving the mark up?
Date Thu, 08 Nov 2007 06:20:11 GMT
On Wed, 7 Nov 2007 20:18:25 -0800 (PST)
David Neubert <devmecrazy@yahoo.com> wrote:

> I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR.
> 
> I have rich xml content (books) that need to searched at granular levels (specifically
paragraph and sentence levels very accurately, no approximations).  My source text has exact
<p></p> and <s></s> tags for this purpose.  I have built this app
in previous versions (using other search engines) indexing the text twice, (1) where every
paragraph was a virtual document and (2) where every sentence was a virtual document  -- both
extracted from the source file (which was a singe xml file for the entire book).  I have of
course thought about using an XML engine eXists or Xindices, but I am prefer to the stability
and user base and performance that Lucene/SOLR seems to have, and also there is a large body
of text that is regular documents and not well formed XML as well.
> 
> I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml
scheme to add documents:
> 
> <add>
>   <doc>
>     <field name="foo1">foo value 1</field>
>     <field name="foo2">foo value 2</field>
>   </doc>
>   <doc>...</doc>
> </add>
> 
> But my problem is that I believe I need to perserve the xml markup at the paragraph and
sentence levels, so I was hoping to create a content field that could just contain the source
xml for the paragraph or sentence respectively.  There are reasons for this that I won't go
into -- alot of granular work in this app, accessing pars and sens.
> 
> Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers)
would work great.  Still I think Lucene can do this in a field level way-- and I also can't
imagine that users who are indexing XML documents have to go through the trouble of striping
all the markup before indexing?  Hopefully I missing something basic?
> 
> It would be great to pointed in the right direction on this matter?
> 
> I think I need something along this line:
> 
> <add>
>   <doc>
>     <field name="foo1">value 1</field>
>     <field name="foo2">value 2</field>
>     ....
>     <field name="content"><an xml stream with embedded source markup></field>
>   </doc>
> </add>
> 
> Maybe the overall question -- is what is the best way to index XML content using SOLR
-- is all this tag stripping really necessary?

crazy/silly idea maybe... could you use dynamic fields, each containing a sentence, and a
reference to the paragraph it belongs to ? 
eg, (not sure if the syntax is correct..)

<dynamicField name="s_*" type="string" />

Then when you create your document you can define
<doc>
  <field name="s_1_p1">{Sentence #1, Para#1}</field>
  <field name="s_2_p1">{Sentence #2, Para#1}</field>
  <field name="s_3_p1">{Sentence #3, Para#1}</field>
  <field name="s_1_p2">{Sentence #1, Para#2}</field>
[...]
</doc>

I have no idea how scalable that would be. 
cheers,
B
_________________________
{Beto|Norberto|Numard} Meijome

Immediate success shouldn't be necessary as a motivation to do the right thing.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers
makes you go blind. Writing them is worse. You have been Warned.

Mime
View raw message