lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: SGML Indexing
Date Thu, 02 Dec 2004 10:59:35 GMT
On Dec 1, 2004, at 3:43 PM, DES wrote:
> i've got to index some SGML documents and need help or some 
> expiriences with it.
> Documents contain a number of different articles with the same 
> structure (TITLE, AUTHOR, DATA etc.), but i need to index each article 
> as a different document in my lucene index. Is there some reader for 
> SGML? I can create different documents, but how can i relate this 
> index-documents to my articles within SGML files?

I don't have SGML parsing experience, so I can't help with that detail. 
  However, I'm doing something quite similar in my current project (once 
it goes live I'll send pointers to it).  I've got a directory of XML 
files.  Each of these XML files has sections.  These sections are what 
the user wants to search on, so each one of those corresponds to a 
Lucene Document.  There are a lot more "documents" indexed than there 
are XML files.  These sections have a unique identifier.  I store the 
filename and section identifier with each document, allowing me to link 
back to the original precisely.

You'll need to identify the unique values in your data that allow you 
to relate back to the original files - perhaps using filename as one 
stored field.  When storing filenames, be sure to account for the 
situation where the original source moves location - I store relative 
paths from a base directory, so that the base directory can move 
without having to re-index to stay in sync.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message