lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zsolt Czinkos <>
Subject Re: SGML Indexing
Date Thu, 02 Dec 2004 11:43:32 GMT

We transformed our SGML documents to XML with sgml2xml (part of
fedora/redhat linux distribution) and processed XML not SGML. We also
use a very very very simple "SAX-like SGML parser" (suitable for our SGML
documents; not a general purpose SGML parser) to extract data from SGML. 
But we use XML for indexing.



On Thu, Dec 02, 2004 at 05:59:35AM -0500, Erik Hatcher wrote:
> On Dec 1, 2004, at 3:43 PM, DES wrote:
> >i've got to index some SGML documents and need help or some 
> >expiriences with it.
> >Documents contain a number of different articles with the same 
> >structure (TITLE, AUTHOR, DATA etc.), but i need to index each article 
> >as a different document in my lucene index. Is there some reader for 
> >SGML? I can create different documents, but how can i relate this 
> >index-documents to my articles within SGML files?
> I don't have SGML parsing experience, so I can't help with that detail. 
>  However, I'm doing something quite similar in my current project (once 
> it goes live I'll send pointers to it).  I've got a directory of XML 
> files.  Each of these XML files has sections.  These sections are what 
> the user wants to search on, so each one of those corresponds to a 
> Lucene Document.  There are a lot more "documents" indexed than there 
> are XML files.  These sections have a unique identifier.  I store the 
> filename and section identifier with each document, allowing me to link 
> back to the original precisely.
> You'll need to identify the unique values in your data that allow you 
> to relate back to the original files - perhaps using filename as one 
> stored field.  When storing filenames, be sure to account for the 
> situation where the original source moves location - I store relative 
> paths from a base directory, so that the base directory can move 
> without having to re-index to stay in sync.
> 	Erik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message