lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ogren, Philip V." <>
Subject Indexing XML Demo
Date Sat, 01 Dec 2001 19:05:03 GMT

Well, I got one response so I went ahead with it.  This is a very simple
demo that demonstrates how to use XSL Transforms and SAX parsing to index
XML documents using Lucene.  The code has been compiled and tested.  

This demo takes several seconds to run but that is due to the cost of
instantiating objects and opening and closing the index.  I have used this
technique to index about 1M documents and have found it reasonably fast
(~120seconds/1000 documents).  I know that statistic was completely
meaningless, but heh...  I think the small amount of code that it took to
put this demo together is a powerful testimony to the SAX technology - I'm
pretty high on XSL right now :)  The main thing, as I mentioned below, is
that I do not need to change any java code if I decide I need to change the
way I index my xml documents. 

Let me know if you find this useful or if for some reason it doesn't work.

Directions: extract the attached jar file and open readme.txt

Philip Ogren

-----Original Message-----
From: Yiyi Sun []
Sent: Thursday, November 29, 2001 10:46 AM
To: Lucene Developers List
Subject: Re: parsing XML


Thanks a lot. I would like to have you XML package and



--- "Ogren, Philip V." <> wrote:
> I didn't pour through the archive to make sure no
> one had done this yet
> but...
> I have a generic way of indexing XML that I think is
> really useful.
> Basically, I implement the DefaultHandler (in SAX)
> that handles XML
> documents that look like something like this:
> <document>
> 	<field name="myfield1" store="true" index="true"
> token="true">a
> small field</field>
> 	<field name="myfield2" store="false" index="true"
> token="true">a
> large field</field>
> </document>
> I haven't actually written a DTD or schema because I
> haven't needed one
> yet.*  I create a org.apache.lucene.document.Field
> for each 'field' tag that
> is processed.  The way I get an XML document that
> conforms to this very
> simplistic schema is through XSLT.  You simply
> create a style sheet that
> transforms your specific xml document into xml that
> conforms with the above
> tags.  It's proven very useful on our project
> because changing the way an
> xml document is indexed requires no change in the
> code - I simply change my
> style sheet and reindex.  
> I would be willing to cut a version of this code
> that would be suitable for
> a demonstration - along with a demo -  if there is
> any interest.  
> Regards,
> Philip Ogren
> *I originally had a 'datefield' tag as well but I
> found the DateField class
> to be useless for my application as it doesn't
> handle dates before 1970.
> > Philip V. Ogren
> > Medical Information Resources
> > Mayo Clinic Rochester
> > (507) 538-0167
> >
> > 
> --
> To unsubscribe, e-mail:  
> <>
> For additional commands, e-mail:
> <>

Do You Yahoo!?
Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message