lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Binkley, Peter" <Peter.Bink...@ualberta.ca>
Subject RE: What is the best way to index xml data preserving the mark up?
Date Thu, 08 Nov 2007 16:23:46 GMT
I've used eXist for this kind of thing and had good experiences, once I
got a grip on Xquery (which is definitely worth learning). But I've only
used it for small collections (under 10k documents); I gather its
effective ceiling is much lower than Solr's. 

Possibly it will be possible to use Lucene's new payloads to do this
kind of thing (at least, storing Xpath information is one of the
proposed uses: http://lucene.grantingersoll.com/2007/03/18/payloads/ ),
as Erik Hatcher suggested in relation to
https://issues.apache.org/jira/browse/SOLR-380 .

Peter

-----Original Message-----
From: David Neubert [mailto:devmecrazy@yahoo.com] 
Sent: Wednesday, November 07, 2007 9:52 PM
To: solr-user@lucene.apache.org
Subject: Re: What is the best way to index xml data preserving the mark
up?

Thanks Walter -- 

I am aware of MarkLogic -- and agree -- but I have a very low budget on
licensed software in this case (near 0) -- 

have you used eXists or Xindices? 

Dave

----- Original Message ----
From: Walter Underwood <wunderwood@netflix.com>
To: solr-user@lucene.apache.org
Sent: Wednesday, November 7, 2007 11:37:38 PM
Subject: Re: What is the best way to index xml data preserving the mark
up?

If you really, really need to preserve the XML structure, you'll be
doing a LOT of work to make Solr do that. It might be cheaper to start
with software that already does that. I recommend MarkLogic -- I know
the principals there, and it is some seriously fine software. Not free
or open, but very, very good.

If your problem can be expressed in a flat field model, then the your
problem is mapping your document model into Solr. You might be able to
use structured field names to represent the XML context, but that is
just a guess.

With a mixed corpus of XML and arbitrary text, requiring special
handling of XML, yow, that's a lot of work.

One thought -- you can do flat fields in an XML engine (like MarkLogic)
much more easily than you can do XML in a flat field engine (like
Lucene).

wunder

On 11/7/07 8:18 PM, "David Neubert" <devmecrazy@yahoo.com> wrote:

> I am sure this is 101 question, but I am bit confused about indexing
 xml data
> using SOLR.
> 
> I have rich xml content (books) that need to searched at granular
 levels
> (specifically paragraph and sentence levels very accurately, no 
> approximations).  My source text has exact <p></p> and <s></s>
tags
 for this
> purpose.  I have built this app in previous versions (using other
 search
> engines) indexing the text twice, (1) where every paragraph was a
 virtual
> document and (2) where every sentence was a virtual document  -- both 
> extracted from the source file (which was a singe xml file for the
 entire
> book).  I have of course thought about using an XML engine eXists or
 Xindices,
> but I am prefer to the stability and user base and performance that 
> Lucene/SOLR seems to have, and also there is a large body of text
 that is
> regular documents and not well formed XML as well.
> 
> I am brand new to SOLR (one day) and at a basic level understand
 SOLR's nice
> simple xml scheme to add documents:
> 
> <add>
>   <doc>
>     <field name="foo1">foo value 1</field>
>     <field name="foo2">foo value 2</field>
>   </doc>
>   <doc>...</doc>
> </add>
> 
> But my problem is that I believe I need to perserve the xml markup at
 the
> paragraph and sentence levels, so I was hoping to create a content
 field that
> could just contain the source xml for the paragraph or sentence
 respectively.
> There are reasons for this that I won't go into -- alot of granular
 work in
> this app, accessing pars and sens.
> 
> Obviously an XML mechanism that could leverage the xml structure (via
 XPath or
> XPointers) would work great.  Still I think Lucene can do this in a
 field
> level way-- and I also can't imagine that users who are indexing XML
 documents
> have to go through the trouble of striping all the markup before
 indexing?
> Hopefully I missing something basic?
> 
> It would be great to pointed in the right direction on this matter?
> 
> I think I need something along this line:
> 
> <add>
>   <doc>
>     <field name="foo1">value 1</field>
>     <field name="foo2">value 2</field>
>     ....
>     <field name="content"><an xml stream with embedded source
 markup></field>
>   </doc>
> </add>
> 
> Maybe the overall question -- is what is the best way to index XML
 content
> using SOLR -- is all this tag stripping really necessary?
> 
> Thanks for any help,
> 
> Dave
> 
> 
> 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com






__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

Mime
View raw message